Refactor code structure for improved readability and maintainability

This commit is contained in:
2026-05-18 11:41:20 +08:00
parent d39de39f96
commit 3f154a3077
43 changed files with 5046 additions and 113 deletions

14
.env
View File

@@ -1,5 +1,15 @@
# DashScope API
DASHSCOPE_API_KEY=your_api_key_here
# ===== Qwen API配置阿里云DashScope=====
# 获取API Key: https://dashscope.console.aliyun.com/
QWEN_API_KEY=sk-MYNyhzr03f1AjF4QcFgrmKL1kJm930smNK98BB9ecDqkDaa3
QWEN_BASE_URL=https://new-api.fletcher0516.online/v1
QWEN_MODEL=qwen3.5-plus
QWEN_VL_MODEL=qwen3-vl-plus
# ===== DeepSeek API配置 =====
# 获取API Key: https://platform.deepseek.com/
DEEPSEEK_API_KEY=sk-MYNyhzr03f1AjF4QcFgrmKL1kJm930smNK98BB9ecDqkDaa3
DEEPSEEK_BASE_URL=https://new-api.fletcher0516.online/v1
DEEPSEEK_MODEL=deepseek-v3.2
# PostgreSQL
POSTGRES_HOST=localhost

45
.gitignore vendored
View File

@@ -5,6 +5,51 @@ build/
dist/
wheels/
*.egg-info
*.egg
*.manifest
*.spec
pip-log.txt
pip-delete-this-directory.txt
# Virtual environments
.venv
venv/
ENV/
env/
# Tests
tests/
# IDE
.idea/
.vscode/
*.swp
*.swo
*~
# OS
.DS_Store
Thumbs.db
# Environment variables
.env
.env.local
.env.*.local
# Logs
*.log
logs/
# Database
*.db
*.sqlite3
# Cache
.pytest_cache/
.mypy_cache/
.ruff_cache/
.coverage
htmlcov/
# Jupyter
.ipynb_checkpoints/

View File

@@ -1 +1 @@
3.9
3.13

419
README.md
View File

@@ -1,68 +1,417 @@
# 车辆法规智能检索系统 - 后端
# 车辆法规智能检索系统 - 后端 API
基于 FastAPI + LangGraph + Milvus + 千问 的法规检索与合规分析后端。
基于 FastAPI + LangGraph + Milvus + 千问大模型 的法规检索与合规分析后端服务
## 目录
- [技术栈](#技术栈)
- [服务依赖](#服务依赖)
- [快速开始](#快速开始)
- [环境配置](#环境配置)
- [API 接口文档](#api-接口文档)
- [项目结构](#项目结构)
- [核心模块详解](#核心模块详解)
- [工作流设计](#工作流设计)
- [数据模型](#数据模型)
## 技术栈
- **Web框架**: FastAPI
- **AI框架**: LangGraph
- **LLM**: 千问 Qwen-Max (DashScope API)
- **Embedding**: DashScope text-embedding-v3
- **向量数据库**: Milvus
| 组件 | 技术选型 | 说明 |
|------|----------|------|
| Web框架 | FastAPI | 高性能异步 API 框架 |
| AI框架 | LangGraph | 状态图工作流编排 |
| LLM | 千问 Qwen-Max | 阿里云 DashScope API |
| Embedding | text-embedding-v3 | DashScope 文本向量服务 |
| 向量数据库 | Milvus | 开源高性能向量检索引擎 |
| 关系数据库 | PostgreSQL | 数据持久化存储 |
| 缓存 | Redis | 会话缓存与任务队列 |
| 图数据库 | Neo4j | 法规关系图谱存储 |
| 消息队列 | RabbitMQ | 异步任务处理 |
## 服务依赖
需要启动以下基础服务:
| 服务 | 端口 | 用户/密码 |
|------|------|-----------|
| PostgreSQL | 5432 | postgresql/postgresql123456 |
| Redis | 6379 | redis@123 |
| Milvus | 19530, 9091 | - |
| MinIO | 9000, 9001 | minioadmin/minioadmin |
| Neo4j | 7474, 7687 | neo4j/neo4j123 |
| RabbitMQ | 5672, 15672 | admin/admin@123 |
## 快速开始
### 1. 启动 Milvus
### 1. 安装依赖
```bash
docker run -d --name milvus-standalone \
-p 19530:19530 \
-p 9091:9091 \
-v $(pwd)/milvus-data:/var/lib/milvus \
milvusdb/milvus:v2.3.3 standalone
cd backend
uv pip install -r requirements.txt
```
### 2. 配置环境变量
```bash
cp .env.example .env
# 编辑 .env 文件,填入 DashScope API Key
# 编辑 .env 文件,配置各项服务参数
```
### 3. 安装依赖
### 3. 启动服务
```bash
pip install -r requirements.txt
uv run uvicorn app.main:app --host 0.0.0.0 --port 8000 --reload
```
### 4. 启动服务
### 4. 访问 API 文档
- Swagger UI: http://localhost:8000/docs
- ReDoc: http://localhost:8000/redoc
## 环境配置
`.env` 文件配置项:
```bash
uvicorn app.main:app --host 0.0.0.0 --port 8000 --reload
# DashScope APILLM 与 Embedding
DASHSCOPE_API_KEY=your_api_key_here
LLM_MODEL=qwen-max
EMBEDDING_MODEL=text-embedding-v3
EMBEDDING_DIM=1536
# PostgreSQL
POSTGRES_HOST=localhost
POSTGRES_PORT=5432
POSTGRES_USER=postgresql
POSTGRES_PASSWORD=postgresql123456
POSTGRES_DB=mydb
# Redis
REDIS_HOST=localhost
REDIS_PORT=6379
REDIS_PASSWORD=redis@123
# Milvus
MILVUS_HOST=localhost
MILVUS_PORT=19530
# MinIO
MINIO_ENDPOINT=localhost:9000
MINIO_ACCESS_KEY=minioadmin
MINIO_SECRET_KEY=minioadmin
# Neo4j
NEO4J_URI=bolt://localhost:7687
NEO4J_USER=neo4j
NEO4J_PASSWORD=neo4j123
# RabbitMQ
RABBITMQ_HOST=localhost
RABBITMQ_PORT=5672
RABBITMQ_USER=admin
RABBITMQ_PASSWORD=admin@123
# 检索配置
VECTOR_TOP_K=10
BM25_TOP_K=10
FINAL_TOP_K=5
# 分块配置
CHUNK_SIZE=800
CHUNK_OVERLAP=50
# 服务配置
API_HOST=0.0.0.0
API_PORT=8000
```
### 5. 访问API文档
## API 接口文档
打开浏览器访问: http://localhost:8000/docs
### 1. 文档管理 `/api/docs`
## API接口
| 模块 | 路径 | 说明 |
| 接口 | 方法 | 说明 |
|------|------|------|
| 文档管理 | `/api/docs` | 上传、解析、索引法规文档 |
| RAG问答 | `/api/rag` | SSE流式问答 |
| 合规分析 | `/api/compliance` | 设计方案合规分析 |
| 系统状态 | `/api/status` | 统计、配置、健康检查 |
| `/upload` | POST | 上传法规文档 (PDF/DOCX/TXT) |
| `/list` | GET | 获取已索引文档列表 |
| `/parse/{doc_id}` | POST | 解析文档并分块 |
| `/embed/{doc_id}` | POST | 向量化并存入 Milvus |
| `/delete/{doc_id}` | DELETE | 删除文档 |
**上传文档响应示例:**
```json
{
"doc_id": "doc-001",
"filename": "道路交通安全法.pdf",
"size": 102400,
"status": "uploaded"
}
```
### 2. RAG 问答 `/api/rag`
| 接口 | 方法 | 说明 |
|------|------|------|
| `/chat` | POST | SSE 流式问答 |
| `/quick-questions` | GET | 获取预设快捷问题 |
**请求参数:**
```json
{
"query": "电动自行车需要上牌照吗?",
"top_k": 5
}
```
**SSE 事件流格式:**
```json
{"type": "retrieving"}
{"type": "retrieved", "docs": [...]}
{"type": "generating", "text": "正在生成答案..."}
{"type": "chunk", "text": "答案片段..."}
{"type": "done"}
```
### 3. 合规分析 `/api/compliance`
| 接口 | 方法 | 说明 |
|------|------|------|
| `/analyze` | POST | 上传设计方案进行分析 |
| `/result/{task_id}` | GET | 获取分析结果 |
| `/chat/{segment_id}` | POST | 针对特定段落进行合规对话 |
**分析结果响应示例:**
```json
{
"task_id": "task-xxx",
"dashboard": {
"score": 78,
"high_risk_count": 2,
"medium_risk_count": 1,
"low_risk_count": 0,
"need_fix_segments": 3,
"status": "warning",
"status_label": "需优化"
},
"segments": [
{
"id": 1,
"intent": "车身结构设计",
"content": "...",
"risk_level": "high",
"regulations": [...]
}
],
"priority_actions": [...]
}
```
### 4. 系统状态 `/api/status`
| 接口 | 方法 | 说明 |
|------|------|------|
| `/stats` | GET | 系统统计数据 |
| `/config` | GET | 当前配置信息 |
| `/milvus/health` | GET | Milvus 健康检查 |
**统计数据响应:**
```json
{
"docs": 5,
"chunks": 510,
"vectors": 510,
"segments": 0
}
```
## 项目结构
```
app/
├── main.py # FastAPI入口
├── core/config.py # 配置管理
├── api/routes/ # API路由
├── services/ # 服务层
├── workflows/ # LangGraph工作流
├── schemas/ # Pydantic模型
└── utils/ # 工具函数
backend/
├── app/
│ ├── main.py # FastAPI 应用入口
│ ├── core/
│ │ └── config.py # Pydantic Settings 配置管理
│ ├── api/
├── __init__.py # API 路由聚合
│ │ └── routes/
│ │ ├── docs.py # 文档管理接口
│ │ ├── rag.py # RAG 问答接口
│ │ ├── compliance.py # 合规分析接口
│ │ └── status.py # 系统状态接口
│ ├── schemas/
│ │ ├── doc.py # 文档相关数据模型
│ │ ├── rag.py # RAG 问答数据模型
│ │ └── compliance.py # 合规分析数据模型
│ ├── services/
│ │ ├── llm.py # LLM 服务封装
│ │ ├── embedding.py # Embedding 服务封装
│ │ ├── milvus.py # Milvus 向量库服务
│ │ ├── document.py # 文档解析服务
│ │ └── mock_data.py # Mock 数据(开发测试)
│ ├── workflows/
│ │ ├── rag_workflow.py # RAG 工作流
│ │ └── compliance_workflow.py # 合规分析工作流
│ └── utils/
│ ├── chunking.py # 文本分块工具
│ └── logger.py # 日志工具
├── data/
│ ├── raw/ # 原始上传文件
│ └── parsed/ # 解析后文件
├── tests/ # 测试目录
├── .env # 环境变量配置
├── .env.example # 环境变量示例
├── requirements.txt # Python 依赖
├── pyproject.toml # 项目配置
└── main.py # 入口脚本
```
## 核心模块详解
### 配置管理 (`app/core/config.py`)
使用 Pydantic Settings 管理配置,自动从 `.env` 文件加载:
```python
class Settings(BaseSettings):
dashscope_api_key: str = ""
milvus_host: str = "localhost"
milvus_port: int = 19530
llm_model: str = "qwen-max"
embedding_model: str = "text-embedding-v3"
embedding_dim: int = 1536
# ...
class Config:
env_file = ".env"
```
### 服务层 (`app/services/`)
#### LLM 服务 (`llm.py`)
- 封装 DashScope API 调用
- 支持流式输出
- 提供对话补全功能
#### Embedding 服务 (`embedding.py`)
- 文本向量化
- 批量嵌入支持
- 维度配置 (1536)
#### Milvus 服务 (`milvus.py`)
- Collection 创建与管理
- 向量插入与检索
- 混合检索 (向量 + BM25)
#### Mock 数据服务 (`mock_data.py`)
- 预设法规文档数据
- 预设问答数据
- 预设合规分析结果
- 用于开发测试阶段
## 工作流设计
### RAG 工作流 (`rag_workflow.py`)
基于 LangGraph 构建的状态图工作流:
```
[用户查询] -> [检索向量库] -> [BM25补充] -> [结果融合] -> [LLM生成] -> [输出答案]
```
### 合规分析工作流 (`compliance_workflow.py`)
```
[上传文档] -> [解析文档] -> [AI语义分段] -> [法规匹配] -> [风险评分] -> [生成建议]
```
状态节点:
- `parse`: 解析文档提取文本
- `segment`: AI 识别语义段落
- `match`: 向量检索匹配法规
- `score`: 计算风险等级
- `suggest`: 生成优先修改建议
## 数据模型
### 文档模型 (`schemas/doc.py`)
| 模型 | 字段 | 说明 |
|------|------|------|
| DocumentUploadResponse | doc_id, filename, size, status | 上传响应 |
| DocumentInfo | id, name, chunks, status, created_at | 文档信息 |
| DocumentListResponse | docs | 文档列表 |
| ParseResponse | doc_id, chunks, status | 解析响应 |
| EmbedResponse | doc_id, vectors, status | 嵌入响应 |
### RAG 模型 (`schemas/rag.py`)
| 模型 | 字段 | 说明 |
|------|------|------|
| RagChatRequest | query, top_k | 问答请求 |
| RetrievedDoc | id, doc_name, clause_id, score, content, preview | 检索文档 |
| QuickQuestion | id, question, category | 快捷问题 |
| QuickQuestionsResponse | questions | 快捷问题列表 |
### 合规模型 (`schemas/compliance.py`)
| 模型 | 字段 | 说明 |
|------|------|------|
| Regulation | id, name, clause, score, match_keyword, category, full_content | 法规条目 |
| ComplianceSegment | id, index, intent, content, risk_level, regulations | 语义段落 |
| RiskDashboard | score, high_risk_count, medium_risk_count, low_risk_count, status | 风险仪表盘 |
| PriorityAction | regulation, issue, suggestion, severity | 优先建议 |
| ComplianceResult | task_id, dashboard, segments, priority_actions | 分析结果 |
### 风险等级枚举
```python
class RiskLevel(str, Enum):
high = "high" # 高风险:需立即修改
medium = "medium" # 中风险:建议优化
low = "low" # 低风险:基本合规
```
### 合规状态枚举
```python
class ComplianceStatus(str, Enum):
pass_status = "pass" # 合规通过
warning = "warning" # 需要优化
fail = "fail" # 不合规
```
## 开发说明
### Mock 模式
当依赖服务未安装或 API Key 未配置时,系统自动使用 Mock 数据模式,返回预设的测试数据,便于前端开发调试。
### SSE 流式输出
使用 `sse-starlette` 库实现 Server-Sent Events 流式响应,适用于:
- RAG 问答实时输出
- 合规对话实时响应
### CORS 配置
已配置允许所有来源的跨域请求,生产环境需根据实际需求调整。
```python
app.add_middleware(
CORSMiddleware,
allow_origins=["*"],
allow_credentials=True,
allow_methods=["*"],
allow_headers=["*"],
)
```
## 测试
```bash
pytest tests/
```
## 许可证
MIT

View File

@@ -1,115 +1,299 @@
from fastapi import APIRouter, UploadFile, File, HTTPException
"""文档管理 API"""
from fastapi import APIRouter, UploadFile, File, HTTPException, BackgroundTasks
import os
import uuid
from datetime import datetime
from typing import Optional
from app.schemas.doc import (
DocumentUploadResponse,
DocumentListResponse,
DocumentInfo,
ParseResponse,
EmbedResponse,
TaskStatusResponse,
)
from app.services.mock_data import get_mock_documents, generate_doc_id
from app.core.config import settings
from app.services.minio import minio_service
from app.services.database import db_service, init_db, DocStatus
from app.services.tasks import generate_task_id, task_manager, get_task_status
from app.workflows.document_workflow import (
generate_doc_id,
run_parse_workflow,
run_embedding_workflow,
)
from app.utils.logger import logger
router = APIRouter(prefix="/docs", tags=["文档管理"])
# 临时存储文档信息包含预设的mock文档
documents_store: dict[str, dict] = {}
# 启动时初始化数据库
init_db()
# 初始化时加载mock文档
for doc in get_mock_documents():
documents_store[doc["id"]] = doc
def get_content_type(filename: str) -> str:
"""根据文件扩展名获取 Content-Type"""
ext = os.path.splitext(filename)[1].lower()
content_types = {
".pdf": "application/pdf",
".docx": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
".doc": "application/msword",
".txt": "text/plain",
}
return content_types.get(ext, "application/octet-stream")
@router.post("/upload", response_model=DocumentUploadResponse)
async def upload_document(file: UploadFile = File(...)):
"""上传法规文档"""
async def upload_document(
file: UploadFile = File(...),
background_tasks: BackgroundTasks = None,
):
"""
上传法规文档到 MinIO并自动触发异步解析
流程:
1. 验证文件格式
2. 生成文档ID
3. 上传到 MinIO
4. 创建数据库记录
5. 触发异步解析任务(后续可替换为 RabbitMQ
"""
# 检查文件格式
allowed_ext = [".pdf", ".docx", ".doc", ".txt"]
ext = os.path.splitext(file.filename)[1].lower()
if ext not in allowed_ext:
raise HTTPException(400, f"Unsupported file format: {ext}")
# 检查文件大小
content = await file.read()
max_size = 50 * 1024 * 1024 # 50MB
if len(content) > max_size:
raise HTTPException(400, f"File size exceeds limit: {max_size // 1024 // 1024}MB")
# 生成文档ID
doc_id = generate_doc_id()
# 保存文件
raw_dir = "/airegulation/demo-mao/backend/data/raw"
os.makedirs(raw_dir, exist_ok=True)
file_path = os.path.join(raw_dir, f"{doc_id}_{file.filename}")
# 构建 MinIO 存储路径
storage_filename = f"{doc_id}_{file.filename}"
minio_path = f"documents/{storage_filename}"
content = await file.read()
with open(file_path, "wb") as f:
f.write(content)
try:
# 上传到 MinIO
content_type = get_content_type(file.filename)
minio_url = minio_service.upload_file(
minio_path,
content,
content_type,
)
# 记录文档信息
documents_store[doc_id] = {
"id": doc_id,
"name": file.filename,
"path": file_path,
"size": len(content),
"status": "uploaded",
"chunks": 0,
"created_at": datetime.now(),
}
# 创建数据库记录
doc = db_service.create_document(
doc_id=doc_id,
filename=storage_filename,
original_name=file.filename,
minio_path=minio_path,
size=len(content),
)
return DocumentUploadResponse(
doc_id=doc_id,
filename=file.filename,
size=len(content),
)
logger.info(f"Document uploaded: {doc_id} - {file.filename}")
# 触发异步解析任务
parse_task_id = generate_task_id()
db_service.create_parse_task(parse_task_id, doc_id)
# 使用 asyncio 异步执行解析(后续替换为 RabbitMQ
background_tasks.add_task(
run_parse_workflow_sync,
parse_task_id,
doc_id,
)
return DocumentUploadResponse(
doc_id=doc_id,
filename=file.filename,
size=len(content),
status="uploaded",
parse_task_id=parse_task_id,
)
except Exception as e:
logger.error(f"Upload failed: {e}")
raise HTTPException(500, f"Upload failed: {str(e)}")
def run_parse_workflow_sync(task_id: str, doc_id: str):
"""同步包装器,用于 BackgroundTasks"""
import asyncio
asyncio.run(run_parse_workflow(task_id, doc_id))
@router.get("/list", response_model=DocumentListResponse)
async def list_documents():
"""获取已索引文档列表"""
docs = [
DocumentInfo(
id=d["id"],
name=d["name"],
chunks=d["chunks"],
status=d["status"],
created_at=d.get("created_at"),
)
for d in documents_store.values()
]
return DocumentListResponse(docs=docs)
docs = db_service.list_documents()
return DocumentListResponse(
docs=[
DocumentInfo(
id=d.id,
name=d.original_name,
chunks=d.chunks,
status=d.status,
created_at=d.created_at,
)
for d in docs
]
)
@router.get("/{doc_id}", response_model=DocumentInfo)
async def get_document(doc_id: str):
"""获取单个文档信息"""
doc = db_service.get_document(doc_id)
if not doc:
raise HTTPException(404, "Document not found")
return DocumentInfo(
id=doc.id,
name=doc.original_name,
chunks=doc.chunks,
status=doc.status,
created_at=doc.created_at,
)
@router.post("/parse/{doc_id}", response_model=ParseResponse)
async def parse_document(doc_id: str):
"""解析文档并分块"""
if doc_id not in documents_store:
async def parse_document(
doc_id: str,
background_tasks: BackgroundTasks = None,
):
"""
手动触发文档解析(如果文档已上传但未解析)
"""
doc = db_service.get_document(doc_id)
if not doc:
raise HTTPException(404, "Document not found")
doc = documents_store[doc_id]
# 模拟解析逻辑
doc["status"] = "parsed"
# 根据文件大小计算chunks数量
file_size = doc.get("size", 100000)
doc["chunks"] = max(20, file_size // 8000)
if doc.status not in [DocStatus.uploaded.value, DocStatus.failed.value]:
raise HTTPException(400, f"Document status is {doc.status}, cannot parse")
return ParseResponse(doc_id=doc_id, chunks=doc["chunks"])
# 创建解析任务
task_id = generate_task_id()
db_service.create_parse_task(task_id, doc_id)
# 异步执行
background_tasks.add_task(
run_parse_workflow_sync,
task_id,
doc_id,
)
return ParseResponse(
doc_id=doc_id,
task_id=task_id,
status="parsing",
)
@router.post("/embed/{doc_id}", response_model=EmbedResponse)
async def embed_document(doc_id: str):
"""嵌入并存入向量库"""
if doc_id not in documents_store:
async def embed_document(
doc_id: str,
background_tasks: BackgroundTasks = None,
):
"""
触发文档向量化(需要文档已解析)
"""
doc = db_service.get_document(doc_id)
if not doc:
raise HTTPException(404, "Document not found")
doc = documents_store[doc_id]
# 模拟嵌入逻辑
doc["status"] = "indexed"
if doc.status != DocStatus.parsed.value:
raise HTTPException(400, f"Document must be parsed first. Current status: {doc.status}")
return EmbedResponse(doc_id=doc_id, vectors=doc["chunks"])
# 创建向量化任务
task_id = generate_task_id()
db_service.create_parse_task(task_id, doc_id)
# 异步执行
background_tasks.add_task(
run_embedding_workflow_sync,
task_id,
doc_id,
)
return EmbedResponse(
doc_id=doc_id,
task_id=task_id,
status="embedding",
)
def run_embedding_workflow_sync(task_id: str, doc_id: str):
"""同步包装器,用于 BackgroundTasks"""
import asyncio
asyncio.run(run_embedding_workflow(task_id, doc_id))
@router.get("/task/{task_id}", response_model=TaskStatusResponse)
async def get_task_status_api(task_id: str):
"""获取任务状态"""
status = get_task_status(task_id)
if not status:
# 检查数据库中的任务记录
task = db_service.get_parse_task(task_id)
if task:
return TaskStatusResponse(
task_id=task_id,
status=task.status,
progress=task.progress or 0,
message=task.message,
)
raise HTTPException(404, "Task not found")
return TaskStatusResponse(
task_id=task_id,
status=status.get("status", "unknown"),
progress=status.get("progress", 0),
message=status.get("message"),
result=status.get("result"),
)
@router.delete("/delete/{doc_id}")
async def delete_document(doc_id: str):
"""删除文档"""
if doc_id not in documents_store:
"""
删除文档
同时删除:
- MinIO 中的文件
- 数据库中的记录
- 解析后的文本文件
"""
doc = db_service.get_document(doc_id)
if not doc:
raise HTTPException(404, "Document not found")
del documents_store[doc_id]
return {"success": True}
try:
# 删除 MinIO 文件
minio_service.delete_file(doc.minio_path)
# 删除本地解析文件
parsed_path = f"{settings.data_parsed_dir}/{doc_id}.txt"
if os.path.exists(parsed_path):
os.remove(parsed_path)
# 删除本地临时文件
temp_path = f"{settings.data_raw_dir}/{doc.filename}"
if os.path.exists(temp_path):
os.remove(temp_path)
# 删除数据库记录
db_service.delete_document(doc_id)
logger.info(f"Document deleted: {doc_id}")
return {"success": True, "doc_id": doc_id}
except Exception as e:
logger.error(f"Delete failed: {e}")
raise HTTPException(500, f"Delete failed: {str(e)}")

View File

@@ -3,13 +3,51 @@ from typing import Optional
class Settings(BaseSettings):
# DashScope API
dashscope_api_key: str = ""
# Qwen API配置
qwen_api_key: str = ""
qwen_base_url: str = "https://dashscope.aliyuncs.com/api/v1"
qwen_model: str = "qwen-max"
qwen_vl_model: str = "qwen-vl-plus"
# DeepSeek API配置
deepseek_api_key: str = ""
deepseek_base_url: str = "https://api.deepseek.com/v1"
deepseek_model: str = "deepseek-v3"
# PostgreSQL
postgres_host: str = "localhost"
postgres_port: int = 5432
postgres_user: str = "postgresql"
postgres_password: str = "postgresql123456"
postgres_db: str = "mydb"
# Redis
redis_host: str = "localhost"
redis_port: int = 6379
redis_password: str = ""
# MinIO
minio_endpoint: str = "localhost:9000"
minio_access_key: str = "minioadmin"
minio_secret_key: str = "minioadmin"
minio_bucket: str = "regulation-docs"
minio_secure: bool = False
# Milvus
milvus_host: str = "localhost"
milvus_port: int = 19530
# Neo4j
neo4j_uri: str = "bolt://localhost:7687"
neo4j_user: str = "neo4j"
neo4j_password: str = "neo4j123"
# RabbitMQ
rabbitmq_host: str = "localhost"
rabbitmq_port: int = 5672
rabbitmq_user: str = "admin"
rabbitmq_password: str = "admin@123"
# LLM配置
llm_model: str = "qwen-max"
embedding_model: str = "text-embedding-v3"
@@ -32,6 +70,10 @@ class Settings(BaseSettings):
regulations_collection: str = "vehicle_regulations"
compliance_collection: str = "compliance_cache"
# 数据目录
data_raw_dir: str = "/airegulation/demo-mao/backend/data/raw"
data_parsed_dir: str = "/airegulation/demo-mao/backend/data/parsed"
class Config:
env_file = ".env"
env_file_encoding = "utf-8"

View File

@@ -1,16 +1,21 @@
"""文档相关数据模型"""
from pydantic import BaseModel
from typing import Optional
from typing import Optional, Any
from datetime import datetime
class DocumentUploadResponse(BaseModel):
"""文档上传响应"""
doc_id: str
filename: str
size: int
status: str = "uploaded"
parse_task_id: Optional[str] = None # 解析任务ID
class DocumentInfo(BaseModel):
"""文档信息"""
id: str
name: str
chunks: int
@@ -19,10 +24,12 @@ class DocumentInfo(BaseModel):
class DocumentListResponse(BaseModel):
"""文档列表响应"""
docs: list[DocumentInfo]
class ChunkInfo(BaseModel):
"""文本块信息"""
chunk_id: str
doc_name: str
clause_id: Optional[str] = None
@@ -33,12 +40,25 @@ class ChunkInfo(BaseModel):
class ParseResponse(BaseModel):
"""解析响应"""
doc_id: str
chunks: int
status: str = "parsed"
task_id: Optional[str] = None
chunks: int = 0
status: str = "parsing"
class EmbedResponse(BaseModel):
"""嵌入响应"""
doc_id: str
vectors: int
status: str = "embedded"
task_id: Optional[str] = None
vectors: int = 0
status: str = "embedding"
class TaskStatusResponse(BaseModel):
"""任务状态响应"""
task_id: str
status: str
progress: int
message: Optional[str] = None
result: Optional[Any] = None

View File

@@ -1,4 +1,9 @@
# Import mock data service
# Import services
from .minio import minio_service, MinioService
from .database import db_service, DatabaseService, init_db, Document, ParseTask
from .tasks import task_manager, get_task_status, set_task_status, generate_task_id
# Import mock data service (for development)
from .mock_data import (
get_mock_documents,
get_mock_quick_questions,
@@ -29,6 +34,18 @@ except ImportError:
get_document_service = None
__all__ = [
# Core services
"minio_service",
"MinioService",
"db_service",
"DatabaseService",
"init_db",
"Document",
"ParseTask",
"task_manager",
"get_task_status",
"set_task_status",
"generate_task_id",
# Mock data services
"get_mock_documents",
"get_mock_quick_questions",

228
app/services/database.py Normal file
View File

@@ -0,0 +1,228 @@
"""数据库服务 - PostgreSQL"""
from sqlalchemy import create_engine, Column, String, Integer, DateTime, Enum, Text
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import sessionmaker, Session
from datetime import datetime
from typing import Optional, List
import enum
from app.core.config import settings
from app.utils.logger import logger
# 数据库连接
DATABASE_URL = f"postgresql://{settings.postgres_user}:{settings.postgres_password}@{settings.postgres_host}:{settings.postgres_port}/{settings.postgres_db}"
engine = create_engine(DATABASE_URL, echo=False, pool_pre_ping=True)
SessionLocal = sessionmaker(autocommit=False, autoflush=False, bind=engine)
Base = declarative_base()
class DocStatus(str, enum.Enum):
"""文档处理状态"""
uploaded = "uploaded" # 已上传
parsing = "parsing" # 解析中
parsed = "parsed" # 已解析
embedding = "embedding" # 向量化中
indexed = "indexed" # 已索引
failed = "failed" # 处理失败
class Document(Base):
"""文档表"""
__tablename__ = "documents"
id = Column(String(64), primary_key=True)
filename = Column(String(255), nullable=False)
original_name = Column(String(255), nullable=False)
minio_path = Column(String(512), nullable=False) # MinIO 存储路径
size = Column(Integer, default=0)
status = Column(String(32), default=DocStatus.uploaded.value)
chunks = Column(Integer, default=0)
vectors = Column(Integer, default=0)
error_message = Column(Text, nullable=True)
created_at = Column(DateTime, default=datetime.now)
updated_at = Column(DateTime, default=datetime.now, onupdate=datetime.now)
def to_dict(self):
return {
"id": self.id,
"name": self.original_name,
"chunks": self.chunks,
"status": self.status,
"created_at": self.created_at.isoformat() if self.created_at else None,
}
class ParseTask(Base):
"""解析任务表"""
__tablename__ = "parse_tasks"
id = Column(String(64), primary_key=True)
doc_id = Column(String(64), nullable=False)
status = Column(String(32), default="pending")
progress = Column(Integer, default=0)
message = Column(Text, nullable=True)
created_at = Column(DateTime, default=datetime.now)
started_at = Column(DateTime, nullable=True)
completed_at = Column(DateTime, nullable=True)
def init_db():
"""初始化数据库表"""
try:
Base.metadata.create_all(bind=engine)
logger.info("Database tables created successfully")
except Exception as e:
logger.error(f"Database initialization failed: {e}")
def get_db() -> Session:
"""获取数据库会话"""
db = SessionLocal()
try:
return db
finally:
# 注意:调用者需要负责关闭会话
pass
class DatabaseService:
"""数据库服务"""
def __init__(self):
self.engine = engine
self.SessionLocal = SessionLocal
def create_document(
self,
doc_id: str,
filename: str,
original_name: str,
minio_path: str,
size: int,
) -> Document:
"""创建文档记录"""
db = self.SessionLocal()
try:
doc = Document(
id=doc_id,
filename=filename,
original_name=original_name,
minio_path=minio_path,
size=size,
status=DocStatus.uploaded.value,
)
db.add(doc)
db.commit()
db.refresh(doc)
return doc
finally:
db.close()
def get_document(self, doc_id: str) -> Optional[Document]:
"""获取文档"""
db = self.SessionLocal()
try:
return db.query(Document).filter(Document.id == doc_id).first()
finally:
db.close()
def update_document_status(
self,
doc_id: str,
status: str,
chunks: int = None,
vectors: int = None,
error_message: str = None,
) -> Optional[Document]:
"""更新文档状态"""
db = self.SessionLocal()
try:
doc = db.query(Document).filter(Document.id == doc_id).first()
if doc:
doc.status = status
doc.updated_at = datetime.now()
if chunks is not None:
doc.chunks = chunks
if vectors is not None:
doc.vectors = vectors
if error_message:
doc.error_message = error_message
db.commit()
db.refresh(doc)
return doc
finally:
db.close()
def list_documents(self) -> List[Document]:
"""列出所有文档"""
db = self.SessionLocal()
try:
return db.query(Document).order_by(Document.created_at.desc()).all()
finally:
db.close()
def delete_document(self, doc_id: str) -> bool:
"""删除文档"""
db = self.SessionLocal()
try:
doc = db.query(Document).filter(Document.id == doc_id).first()
if doc:
db.delete(doc)
db.commit()
return True
return False
finally:
db.close()
def create_parse_task(self, task_id: str, doc_id: str) -> ParseTask:
"""创建解析任务"""
db = self.SessionLocal()
try:
task = ParseTask(id=task_id, doc_id=doc_id)
db.add(task)
db.commit()
db.refresh(task)
return task
finally:
db.close()
def get_parse_task(self, task_id: str) -> Optional[ParseTask]:
"""获取解析任务"""
db = self.SessionLocal()
try:
return db.query(ParseTask).filter(ParseTask.id == task_id).first()
finally:
db.close()
def update_parse_task(
self,
task_id: str,
status: str,
progress: int = None,
message: str = None,
) -> Optional[ParseTask]:
"""更新解析任务状态"""
db = self.SessionLocal()
try:
task = db.query(ParseTask).filter(ParseTask.id == task_id).first()
if task:
task.status = status
if progress is not None:
task.progress = progress
if message:
task.message = message
if status == "running":
task.started_at = datetime.now()
elif status in ("completed", "failed"):
task.completed_at = datetime.now()
db.commit()
db.refresh(task)
return task
finally:
db.close()
# 单例
db_service = DatabaseService()

122
app/services/minio.py Normal file
View File

@@ -0,0 +1,122 @@
"""MinIO 文件存储服务"""
import io
from minio import Minio
from minio.error import S3Error
from app.core.config import settings
from app.utils.logger import logger
class MinioService:
"""MinIO 文件存储服务"""
def __init__(self):
self.client = Minio(
settings.minio_endpoint,
access_key=settings.minio_access_key,
secret_key=settings.minio_secret_key,
secure=settings.minio_secure,
)
self.bucket = settings.minio_bucket
self._ensure_bucket()
def _ensure_bucket(self):
"""确保存储桶存在"""
try:
if not self.client.bucket_exists(self.bucket):
self.client.make_bucket(self.bucket)
logger.info(f"Created MinIO bucket: {self.bucket}")
except S3Error as e:
logger.error(f"MinIO bucket check failed: {e}")
def upload_file(
self,
object_name: str,
file_data: bytes,
content_type: str = "application/octet-stream",
) -> str:
"""
上传文件到 MinIO
Args:
object_name: 对象名称(文件路径)
file_data: 文件二进制数据
content_type: 文件类型
Returns:
文件的 MinIO URL
"""
try:
data_stream = io.BytesIO(file_data)
self.client.put_object(
self.bucket,
object_name,
data_stream,
length=len(file_data),
content_type=content_type,
)
url = f"{settings.minio_endpoint}/{self.bucket}/{object_name}"
logger.info(f"Uploaded file to MinIO: {object_name}")
return url
except S3Error as e:
logger.error(f"MinIO upload failed: {e}")
raise
def get_file(self, object_name: str) -> bytes:
"""
从 MinIO 获取文件
Args:
object_name: 对象名称
Returns:
文件二进制数据
"""
try:
response = self.client.get_object(self.bucket, object_name)
data = response.read()
response.close()
response.release_conn()
return data
except S3Error as e:
logger.error(f"MinIO get file failed: {e}")
raise
def delete_file(self, object_name: str) -> bool:
"""
删除 MinIO 中的文件
Args:
object_name: 对象名称
Returns:
是否成功删除
"""
try:
self.client.remove_object(self.bucket, object_name)
logger.info(f"Deleted file from MinIO: {object_name}")
return True
except S3Error as e:
logger.error(f"MinIO delete failed: {e}")
return False
def list_files(self, prefix: str = "") -> list[str]:
"""
列出 MinIO 中的文件
Args:
prefix: 文件前缀过滤
Returns:
文件名列表
"""
try:
objects = self.client.list_objects(self.bucket, prefix=prefix)
return [obj.object_name for obj in objects]
except S3Error as e:
logger.error(f"MinIO list files failed: {e}")
return []
# 单例
minio_service = MinioService()

89
app/services/tasks.py Normal file
View File

@@ -0,0 +1,89 @@
"""异步任务处理模块
TODO: 后续替换为 RabbitMQ 消息队列
"""
import asyncio
import uuid
from datetime import datetime
from typing import Callable, Awaitable
from app.utils.logger import logger
# 任务状态存储(后续替换为 Redis
_task_store: Dict[str, Dict] = {}
def generate_task_id() -> str:
"""生成任务ID"""
return f"task-{uuid.uuid4().hex[:12]}"
class AsyncTaskManager:
"""异步任务管理器"""
def __init__(self):
self._running_tasks: dict[str, asyncio.Task] = {}
def create_task(
self,
task_id: str,
task_func: Callable[[str], Awaitable[None]],
) -> asyncio.Task:
"""
创建异步任务
Args:
task_id: 任务ID
task_func: 任务执行函数
Returns:
asyncio.Task
"""
task = asyncio.create_task(self._run_task(task_id, task_func))
self._running_tasks[task_id] = task
return task
async def _run_task(
self,
task_id: str,
task_func: Callable[[str], Awaitable[None]],
):
"""运行任务并处理状态"""
try:
await task_func(task_id)
except Exception as e:
logger.error(f"Task {task_id} failed: {e}")
_task_store[task_id] = {
"status": "failed",
"error": str(e),
"completed_at": datetime.now(),
}
finally:
if task_id in self._running_tasks:
del self._running_tasks[task_id]
def get_task_status(self, task_id: str) -> dict | None:
"""获取任务状态"""
return _task_store.get(task_id)
def cancel_task(self, task_id: str) -> bool:
"""取消任务"""
if task_id in self._running_tasks:
self._running_tasks[task_id].cancel()
return True
return False
# 单例
task_manager = AsyncTaskManager()
def get_task_status(task_id: str) -> dict | None:
"""获取任务状态"""
return _task_store.get(task_id)
def set_task_status(task_id: str, status: dict):
"""设置任务状态"""
_task_store[task_id] = status

View File

@@ -0,0 +1,252 @@
"""文档解析工作流 - 异步处理"""
import asyncio
import uuid
from datetime import datetime
from typing import List
import io
from app.core.config import settings
from app.services.minio import minio_service
from app.services.database import db_service, DocStatus
from app.services.tasks import set_task_status, get_task_status
from app.services.document import DocumentService
from app.utils.chunking import TextChunker
from app.utils.logger import logger
def generate_doc_id() -> str:
"""生成文档ID"""
return f"doc-{uuid.uuid4().hex[:12]}"
def generate_chunk_id(doc_id: str, index: int) -> str:
"""生成块ID"""
return f"{doc_id}-chunk-{index}"
async def run_parse_workflow(task_id: str, doc_id: str):
"""
执行文档解析工作流
处理步骤:
1. 获取文件 - 从 MinIO 下载文件
2. 解析文档 - 提取文本内容
3. 文本分块 - 按条款或固定大小分块
4. 保存结果 - 存储分块数据
Args:
task_id: 任务ID
doc_id: 文档ID
"""
chunker = TextChunker()
doc_service = DocumentService(settings.data_raw_dir, settings.data_parsed_dir)
try:
# Step 1: 获取文件
set_task_status(task_id, {
"status": "running",
"step": "fetching",
"progress": 10,
"message": "正在从存储获取文件...",
"started_at": datetime.now(),
})
db_service.update_document_status(doc_id, DocStatus.parsing.value)
doc = db_service.get_document(doc_id)
if not doc:
raise ValueError(f"Document {doc_id} not found")
# 从 MinIO 获取文件
file_data = minio_service.get_file(doc.minio_path)
# 保存到本地临时目录(用于解析)
temp_path = f"{settings.data_raw_dir}/{doc_id}_{doc.filename}"
with open(temp_path, "wb") as f:
f.write(file_data)
await asyncio.sleep(0.5) # 模拟延迟
# Step 2: 解析文档
set_task_status(task_id, {
"status": "running",
"step": "parsing",
"progress": 30,
"message": "正在解析文档内容...",
})
text = doc_service.parse_document(temp_path)
if not text:
raise ValueError("Document parsing returned empty content")
# 保存解析后的文本
parsed_path = doc_service.save_parsed_text(doc_id, text)
await asyncio.sleep(0.5)
# Step 3: 文本分块
set_task_status(task_id, {
"status": "running",
"step": "chunking",
"progress": 50,
"message": "正在进行文本分块...",
})
# 尝试按条款分块,如果不是法规格式则按大小分块
chunks = chunker.chunk_by_clause(text)
if len(chunks) == 0:
chunks = chunker.chunk_by_size(text)
await asyncio.sleep(0.5)
# Step 4: 保存分块结果
set_task_status(task_id, {
"status": "running",
"step": "saving",
"progress": 80,
"message": f"正在保存 {len(chunks)} 个文本块...",
})
# TODO: 将分块存储到数据库或向量库
# 这里先统计数量
chunk_count = len(chunks)
await asyncio.sleep(0.5)
# Step 5: 完成
set_task_status(task_id, {
"status": "completed",
"step": "done",
"progress": 100,
"message": f"解析完成,共生成 {chunk_count} 个文本块",
"completed_at": datetime.now(),
"result": {
"doc_id": doc_id,
"chunks": chunk_count,
"parsed_path": parsed_path,
}
})
db_service.update_document_status(
doc_id,
DocStatus.parsed.value,
chunks=chunk_count,
)
logger.info(f"Parse workflow completed for doc {doc_id}: {chunk_count} chunks")
except Exception as e:
logger.error(f"Parse workflow failed for doc {doc_id}: {e}")
set_task_status(task_id, {
"status": "failed",
"step": "error",
"progress": 0,
"message": str(e),
"completed_at": datetime.now(),
})
db_service.update_document_status(
doc_id,
DocStatus.failed.value,
error_message=str(e),
)
async def run_embedding_workflow(task_id: str, doc_id: str):
"""
执行向量化工作流
处理步骤:
1. 获取分块数据
2. 生成向量嵌入
3. 存入向量数据库
Args:
task_id: 任务ID
doc_id: 文档ID
"""
try:
# Step 1: 获取分块
set_task_status(task_id, {
"status": "running",
"step": "fetching_chunks",
"progress": 10,
"message": "正在获取文本分块...",
"started_at": datetime.now(),
})
db_service.update_document_status(doc_id, DocStatus.embedding.value)
doc = db_service.get_document(doc_id)
if not doc:
raise ValueError(f"Document {doc_id} not found")
await asyncio.sleep(0.5)
# Step 2: 生成嵌入
set_task_status(task_id, {
"status": "running",
"step": "embedding",
"progress": 40,
"message": "正在生成向量嵌入...",
})
# TODO: 调用 Embedding 服务生成向量
# 这里先模拟处理
vector_count = doc.chunks
await asyncio.sleep(1)
# Step 3: 存入向量库
set_task_status(task_id, {
"status": "running",
"step": "storing",
"progress": 70,
"message": "正在存入向量数据库...",
})
# TODO: 存入 Milvus
await asyncio.sleep(0.5)
# Step 4: 完成
set_task_status(task_id, {
"status": "completed",
"step": "done",
"progress": 100,
"message": f"向量化完成,共处理 {vector_count} 个向量",
"completed_at": datetime.now(),
"result": {
"doc_id": doc_id,
"vectors": vector_count,
}
})
db_service.update_document_status(
doc_id,
DocStatus.indexed.value,
vectors=vector_count,
)
logger.info(f"Embedding workflow completed for doc {doc_id}")
except Exception as e:
logger.error(f"Embedding workflow failed for doc {doc_id}: {e}")
set_task_status(task_id, {
"status": "failed",
"step": "error",
"progress": 0,
"message": str(e),
"completed_at": datetime.now(),
})
db_service.update_document_status(
doc_id,
DocStatus.failed.value,
error_message=str(e),
)

BIN
data/pdf_chunks/page_1.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 149 KiB

View File

@@ -0,0 +1,24 @@
=== 第 1 页切片结果 ===
图片: page_1.png
MD5: 3a0587908dec601c902b954eeecc6365
文本长度: 194 字符
切片大小: 500 字符
切片数量: 1 个
--- Chunk 0 ---
长度: 194 字符
内容:
!"#!"!#!$
$%"
!
! " # $ % & ( ) *
%&#&"&"!$$(
"# "#$%&%&!$(
! " # $ % & ( )
#()*+,)-./,)0)1*2(3,45/67*,/4+46)2
$$(8$8#*+
$$"8$#8$#,-
!"#$%&(+,-./0/123
! ( ) * 4 5 6 7 8 9
* +

BIN
data/pdf_chunks/page_10.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 581 KiB

View File

@@ -0,0 +1,65 @@
=== 第 10 页切片结果 ===
图片: page_10.png
MD5: e043bd43659fa8cd131f0d1c4b90c4db
文本长度: 1033 字符
切片大小: 500 字符
切片数量: 3 个
--- Chunk 0 ---
长度: 500 字符
内容:
7!!YZ+<"+<./W+<12!?.n~-)6f.7#zb~ef$8bKBC,-.
!GH%
"$&$$!GY8Z3[\
"$&$$$!!WX_
,-VY~W)$%$s$9.N?VY#bbt2,-:R;?VY%
"$&$$$$!WXab
0)12H4.n?<3$&=~[,-H<3\3/%
"$&$$$!WXGcb
C!G$0)12!p.7H~.2,-M>s?$%.N$s.NW9.N?3%
"$&$$$"!d3ef8mg
C,-!G$0)12!p.7H~.2,-M[)?@A$%.!/BB?TC%
"$&$$$(!*+hi
12345~.2ab\D+<12W+<./%
/!!CD(12345?,-$!G.&W0)12!H~.nEFXbst?+<+80)G
p<|FC0)"GHIJKM12!zbGLMrJUXNop%a3ZqpM12!0
)Wb*NO12?YZ%
1!!CD(123455T.0)12!H~]CEFXbst?+<12
(hjk%&Glm#n;o,pPq%)
2!!CD(12 3 45 ? , -.&W !G.&W0)12!H~. n<sm?n2Py .
--- Chunk 1 ---
长度: 499 字符
内容:
Y Z
RS%
"$&$$$&!*+Z3rstuKKv[\
/!!~.2QI?0)78*
1!!4Rm#~+Uqi+.k5GG{12&G=*
2!!~.2yz+LR?Sy0T"aDEFnyGH?Sy0T#?@IJnyGH?Sy0
Tn!%
"$&$$$#!w;8xy
~.2U5.\Dqi+~/o>V$W|$rXpiY?CD12%
"$&$$$)!zW{_8U|
~.2,-D,GZ8[gn\?VY.K]%
l/,-~.2^,-?2,K"$%&KK!XpW_Wl/WabCk$Z8[gn\
?VY.K]%
(!ABrs
($!!]^,-
($!$!!AB}X
24HMC56~C,39-Hls%
56cd~qp$lsj9-en!?yb#Nlsj9-Cn!?yb%a39-56N?n
!?ff/XN?56ybe8ls#4WCc?9-Hlsghyb?56%
($!$$!AB~!"~
ZEVa&/#M.Nk0?5_<fOC&>*;_?5_<fOC!>*,f?5_<fOC!D*
MC;<?5_<fOC+$&00%
($!$!AB#$
ZEVa&/#56"9-~CifO"#EC&E?=>
--- Chunk 2 ---
长度: 33 字符
内容:
kUqJH"F#zbCifO"#EC
!"!"#"#!$%%&

BIN
data/pdf_chunks/page_11.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 536 KiB

106
data/pdf_chunks/page_11.txt Normal file
View File

@@ -0,0 +1,106 @@
=== 第 11 页切片结果 ===
图片: page_11.png
MD5: 256acb056ed96cb4ab582d55b3792e49
文本长度: 1147 字符
切片大小: 500 字符
切片数量: 3 个
--- Chunk 0 ---
长度: 500 字符
内容:
!+E=>kls56!
($$!c9defgh?AB"%"$!$!#
345HM0)?$W) ,-(&%"++#kj 6$!k$%M&/?TU.q$i+kV/WXY
Z[?5678q ,-(&%"++#BC 6&/?5678ls56!
($!xy1zAB"%"$!$$#
12345?TU?ab?5678q ,-(&%"++#BC -?CD&/ls!
($"!&AB"%"$$&"$($"$$#
)t#k?&/EBlC?@HaCNOPm~q&$)"CNuv?r/56#?RS)EB
YGCNOPH&WG?@\N)[t?i2HzCDgD|H+zz/%$&=;EB!
n !KLMNO;P$()*S84+34,-.n
d3
?@ABC6D(00
EB;_(=;
v)uv,f!
!
"#&&
!!
(D
"
#&&$#)+
!!
)D#+.
#
#)!$%"+
!#$&
*D
%
%"!$%&+
!&$)
*D
&
%&!$%)+
!(
*D#+.
%)!$&!+
!)
!+D
(
&!!$&&+
"+
!!D#+.
)
&&!$&(+
"+
!#D
*
&(!$+
--- Chunk 1 ---
长度: 500 字符
内容:
+
""$&
!&D
!+
+!$"+
""$&
!(D
!!
"!$+
"%$)
!(D
!"
!$)*+
"(
!(D
!#
)*!$(!+
"*$#
!(D
!!12345qHnEB}BN&T+$#0 ?Lfu0_ABCop?qSKH&Cr3O!AB
"12345~u\9:<s=>&zQ#Bb!
($(!!"#$AB"%"$$!#
q ,-(&%"++#k 4$&$)"ghij56#ls56!
($&!!"%&AB"%"$$$#
q ,-(&%"++#k 4$&$*"ghkl56#ls56!
($#!*+5AB"%"$$(#
q ,-(&%"++#k 4$&$""pq+56#ls56!
($)!=>@91AB"%"$"$!#
5_12345??@ABC6D"!.;<&@="#&)12345q=!?7JHCt#k?
&/uv,f?56v)H&012345?N48M[uv7Cos!
qt#k?&/EBC?@H}B&_C7~2\?@)shk7H7!&+00 u&12345?
!Oc*~z/\"32H&^2Ht12345tv)CHu2m&W0"4,Dv!
--- Chunk 2 ---
长度: 147 字符
内容:
$)0 Rw7
f?u2xy!tls56m~)z{)_4B|M&X{_R2}Z~~@_^d!C!>f+b
12345Z~^d!
($*!5#34?@91AB"%"$"$$$!#
)"4[5zT9)z{|M&zC12345??@Hqt#k?&/EBls}B&_C7~
(
!"!"#"#!$%%&

BIN
data/pdf_chunks/page_12.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 246 KiB

View File

@@ -0,0 +1,25 @@
=== 第 12 页切片结果 ===
图片: page_12.png
MD5: 36205c3c596cc35b23b053fb82ff16f0
文本长度: 380 字符
切片大小: 500 字符
切片数量: 1 个
--- Chunk 0 ---
长度: 380 字符
内容:
2\?@)shk7H7!&+00 u!)345?fN54v"L!++00"]"4JH)L!++00#
@=%$#12345Z~C"^d!
12O#$
< "!5#34?@91AB
($!%! 5]34?@91AB"%"$"$$$$$
qt#k?&/EB C 1 2 3 4 5 ? @ H l s } B#) " 5 4 " L !++ 00"] N 4 J H ) L
!++00#@=&$#ym12345Z~CN^d!
a312345C3gNOPWN@?EFGH#3V12Wo%YCH)[")?<sG3,;
<#4~CTNOP"WN@$k7t[?@Ni\%?8MHqt#S}[<sG\,;_?EB#^
EBC7~2\^8MH6DO="M<??@ABC6D"!.;<$}!&+00 u!
12O#$
< (!5]34?@91AB
)
!"!"#"#!$%%&

BIN
data/pdf_chunks/page_13.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 294 KiB

View File

@@ -0,0 +1,25 @@
=== 第 13 页切片结果 ===
图片: page_13.png
MD5: c2a7f4413915dde39c39f15dd5b9c532
文本长度: 422 字符
切片大小: 500 字符
切片数量: 1 个
--- Chunk 0 ---
长度: 422 字符
内容:
!!a312345C3gXH?NOPWN@?EFGH!4_CNuv?r/56~jNOP
l(ls"a3EFGHH?EB[?@H?EB\&!4?@H?EB~Ri3,fWC"(H
X0EBrsC!/2HH"?@H?EB[EFGHH?EBZ~\)*p"Ri?,fW(H6
DCr6fEBZ,D&?"+b~O\p"^R,W(Hm}~At12345CNuvm!;
<G*Qw*+,#j-;<G(V?.9KL/)"
($!!!ABC}~AB#%"$($$$$
)nt)DEFnyC\pk]mfu"hiDNF?wxoS}&++B ?0!_7C0"z[
DEFT?8Mk%&D,#@=$"
< &!ABC}~AB
($!$!ABC(8KLAB#%"$($$($
)DEF91KG{C5z."lEFs!qD,GLR?0TkynyGH!wNjDEF%"
lnyGHS}"+B&0 ?0T#@=($"
< #!ABC(8KLAB
!"!"#"#!$%%&

BIN
data/pdf_chunks/page_14.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 139 KiB

View File

@@ -0,0 +1,7 @@
=== 第 14 页切片结果 ===
图片: page_14.png
MD5: 1786903bed6f0742383508e7de103f6a
文本长度: 0 字符
切片大小: 500 字符
切片数量: 0 个

BIN
data/pdf_chunks/page_2.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 250 KiB

143
data/pdf_chunks/page_2.txt Normal file
View File

@@ -0,0 +1,143 @@
=== 第 2 页切片结果 ===
图片: page_2.png
MD5: 5a0da54af84e8f0a099ce6b22e5d870d
文本长度: 1858 字符
切片大小: 500 字符
切片数量: 4 个
--- Chunk 0 ---
长度: 500 字符
内容:
!!!"
"#
!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!$%
!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
"!&$()*+
!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
#!,-./0
!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
#$!!12345
!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
#$"!46
!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
#$#!78
"
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
#$%!9:;<=>
"
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
#$&!?@ABC6D
"
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
#$!EFGHI
"
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
#$(!JKLM
"
!!!!!!!!!
--- Chunk 1 ---
长度: 500 字符
内容:
!!!!!!!!!!!!!!!!!!!!!!!!!!!
#$)!NOP
"
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
%!Q,RS
#
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
%$!!TU
#
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
%$!$!!V/WXYZ[\]^_
#
!!!!!!!!!!!!!!!!!!!!!!!!!!!!
%$!$"!ab
#
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
%$"!cdef
#
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
%$#!ghij!ghkl!EFGHI!mno.pq+
#
!!!!!!!!!!!!!!!!!!!
%$#$!!ghij
#
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
%$#$"!ghkl
#
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
%$#$#!EFGHI
#
!!!!!!!!!!!!!!!!!!!!!
--- Chunk 2 ---
长度: 500 字符
内容:
!!!!!!!!!!!!!
%$#$%!mno
%
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
%$#$&!pq+
%
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
%$%!r/
%
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
%$%$!!str/
%
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
%$%$"!uvr/
%
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
%$&!q+
%
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
%$&$!!wxyz+
%
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
%$&$"!{|}~
%
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
%$&$#!!"#$
%
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
%$&$%!?@
&
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
--- Chunk 3 ---
长度: 358 字符
内容:
!
%$&$&!%&ef
&
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
%$&$!()*+z
&
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
%$&$(!JKLMef
&
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
%$&$)!BC
&
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
%$!,-./.0)12
&
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
%$$!!34RS
&
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
%$$"!./.0)12
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
"
!"!"#"#!$%%&

BIN
data/pdf_chunks/page_3.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 219 KiB

View File

@@ -0,0 +1,85 @@
=== 第 3 页切片结果 ===
图片: page_3.png
MD5: 1ce32bc780c7e9192ce48acc2b9f7dea
文本长度: 1089 字符
切片大小: 500 字符
切片数量: 3 个
--- Chunk 0 ---
长度: 499 字符
内容:
&!5678
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
&$!!34RS
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
&$!$!!569-
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
&$!$"!56:;<f
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
&$!$#!56=>
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
&$"!V/WXYZ[?56!@%$!$!"
(
!!!!!!!!!!!!!!!!!!!!!!!!!
&$#!ab56!@%$!$""
(
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
&$%!AB56!@%$"#%$&$%$""
(
!!!!!!!!!!!!!!!!!!!!!!!!!!!!
&$&!ghij56!@%$#$!"
(
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
&$!ghkl56!@%$#$""
(
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
--- Chunk 1 ---
长度: 500 字符
内容:
&$(!pq+56!@%$#$&"
(
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
&$)!str/56!@%$%$!"
(
!!!!!!!!!!!!!!!!!!!!!!!!!!!!
&$*!C"uv?r/56!@%$%$"$!"
(
!!!!!!!!!!!!!!!!!!!!!!!!
&$!+!CNuv?r/56!@%$%$"$""
)
!!!!!!!!!!!!!!!!!!!!!!!!
&$!!!DEFef56!@%$&$#$""
*
!!!!!!!!!!!!!!!!!!!!!!!!!!!
&$!"!DEFnyGH56!@%$&$#$&"
*
!!!!!!!!!!!!!!!!!!!!!!!!!
&$!#!?@IJnyGH56!@%$&$%$""
!+
!!!!!!!!!!!!!!!!!!!!!!!
&$!%!%&56!@%$&$&"
!+
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
&$!&!()*+z56!@%$&$"
!+
!!!!!!!!!!!!!!!!!!!!!!!!!
&$!!JKLMef56!@%$&$("
!+
!!!!!
--- Chunk 2 ---
长度: 89 字符
内容:
!!!!!!!!!!!!!!!!!!!!!
&$!(!BCDKLf56!@%$&$)$""
!+
!!!!!!!!!!!!!!!!!!!!!!!!!
#
!"!"#"#!$%%&

BIN
data/pdf_chunks/page_4.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 289 KiB

View File

@@ -0,0 +1,37 @@
=== 第 4 页切片结果 ===
图片: page_4.png
MD5: 0d28f24386def56047a5f6e6250c75ab
文本长度: 585 字符
切片大小: 500 字符
切片数量: 2 个
--- Chunk 0 ---
长度: 500 字符
内容:
#!!$
!!M.NOeP.N!
M.NQRSTUV"WX ,-!%(%(#!**#$12345YZRS%!
M.N[ ,-!%(%(#!**#\]"^R_ab&
###cde#$( JKLM()#$) NOP(fg/0*
###hi ,-(&#"++#"j%$! TU(k?%$!$! V/WXYZ[\]^_(.%$!$" a
b(?Q,RS.5678lsemn*
###op ,-&"*$&$qr-0)12!st0)12%"j%$ ,-uv(lsemn*
###cde%$&$& %&ef()%$&$ ()*+z()%$&$( JKLMef(.%$&$)$"BC
DKLf(wxyQ,RS"z{|d}e\~5678*
###mne&$!+ CNuv?r/56(5678*
###cde5678?&$! 34RS("!"&$!$! 569-()&$!$" 56:;<f(.&$!$#
56=>(?34RS!
M.N#k$%&()*+H!
M.N#Z$st.NQ,,-*./!
M.NV012&34k%(56k7)89:,-;_<=>?k7)
--- Chunk 1 ---
长度: 85 字符
内容:
@AB12)-C^DE!
M.N^RV0F&GH!)IJK)LL!
M.NMWX.N?NOPMQRSTO&
###,-!%(%(#!**#!
!
!"!"#"#!$%%&

BIN
data/pdf_chunks/page_5.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 300 KiB

View File

@@ -0,0 +1,32 @@
=== 第 5 页切片结果 ===
图片: page_5.png
MD5: 839a0e80672006dd66ed82a984db2f17
文本长度: 474 字符
切片大小: 500 字符
切片数量: 1 个
--- Chunk 0 ---
长度: 474 字符
内容:
% & ( ) * + , -
!!./
M.N&/eU3V12WXV12;Y?12345?YZQ,RS.5678!
M.NZ[)\st345W]^)\_Vab??345"acd345#!
$!0.12345
be*+k?fghiM.N?()jkOM.N?fg!lmnUo?()*+$_pNMC
?mn1"Z!"qr?st#WmuPvZ[)\M.N$wj$xyhiM.Nzk{|?}7~!
m"W0)#$*+?\cPM!lmZnUo?()*+$_\cPM[)\M.N!
,-(&%"++#!$%stYZQ,&$
!6789:
be,-./0[)\M.N!
$!
%&()!#$%&()%#*#&+,
3&45($}54[K)?x*o~b+k3,+W-+$z./F0BC12"4jst
?5(!a34B[K)?x*o*k?+=O-+$45466f~p\646?37!
$$
(;!()-#.
C3hDw48?f54T9?6D$:f54[K)x*?El;<"@=!#!
< !!)(=>?@91AB
!
!"!"#"#!$%%&

BIN
data/pdf_chunks/page_6.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 308 KiB

View File

@@ -0,0 +1,42 @@
=== 第 6 页切片结果 ===
图片: page_6.png
MD5: 4c7cdc6905a0bb9d38b4ccbb7ea9526a
文本长度: 518 字符
切片大小: 500 字符
切片数量: 2 个
--- Chunk 0 ---
长度: 500 字符
内容:
$
CD!/)-#(0)+
TU?>7?fzAW0*+Zb@ABC9:0)kWb,D?EF!
$"
EFGHIJ!12)3-&)%+4+,(0)+
<5GYC12345??@H"IBJCBCH"IKLMDNFW!Oc*?PQ!
$(
KLMNO;P#!5QR$!%,(-1#+/)+3,-&+(26+-&
BCRAD@2\S?2H"T?@)k7UBCVB)k7?\]6D#@="$!
< $!KLMNO;P#!.QR$AS
$&
TUVWX!+762,+6)2()0,%21
lW)XbSTG"vYO12345?EFGHI%
/$!Z.N[\W[]^E"__GNGHaf]\)00&b_clde7fp\$#00
?ghqi+
1$!C[jkyN"]\[]?Ef;<?[]?EFGHil
2$!C[jkyN"GHil]\#$"00"9:<sm[<5G?ghi2Wbx*?[]?E
FGHil!
$#
YZ[\!-,,%,(860,$8)2
)\<|FJK12;<12345m"lWRCn?GH!
$)
]^_!9-#.8()+-&+
12345NiWU12oE?
--- Chunk 1 ---
长度: 18 字符
内容:
GH!
"
!"!"#"#!$%%&

BIN
data/pdf_chunks/page_7.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 517 KiB

106
data/pdf_chunks/page_7.txt Normal file
View File

@@ -0,0 +1,106 @@
=== 第 7 页切片结果 ===
图片: page_7.png
MD5: 24bb1954cdc6c4c03951780a2ffa40e3
文本长度: 1017 字符
切片大小: 500 字符
切片数量: 3 个
--- Chunk 0 ---
长度: 499 字符
内容:
"!6,-
"$!!ab
"$!$!!c9defghijkS
"$!$!$!!lm,-
12345?W*pi+.TU!q&$""V/WXYZ[?56#56!V/WXYZ[?56)
3?r9s~W)t!k?\]^_?&/$
n !!%&()aboc9defgh?ijkS
Z!![
u
31
v
45
w
-/
x
67
y
68
z
91
{
:;
|
3<
\]^_%"0;%=;#
+
"&
!+++
(&
+
*+
+
&++
!!C}~A12?9:.W!@?sOm!a312345"$i+WTU#\_W*p&#b&
;_&;<W_V$W2%&Z()&*+W,-.k?/0!4#$i+WTUZ[)MRS$
"$!$!$$!ABpqrE
#\&$""V/WXYZ[?56#?<1f?2!C}~R?3T956)3m4R3g_r
9?l5)3$&$""V/WXYZ[?56#?l5)3~67t"kl5r9s!X8Ar9N?l
5)3$
l12345TU?l5)3r9s9\Wn\t!k\]^_!4:5OmW)M.N?RS$
n $!sghturEvw
Z!![
u
31
v
45
--- Chunk 1 ---
长度: 500 字符
内容:
w
-/
x
67
y
68
z
91
{
:;
|
3<
l5r9#;%">#
+
+
#+
#+
#+
#+
&+
+
!!<=
z?l5)3O!"+0;%=;!t"k?l5)3r9#;O#+>!4
l5)3r9s?!"+@!"+A#+>?!"+@#?)%"0;%=;#$
#g;>:5OW)M.N?RS"t!kWXYzZ[?\]^_O*+0;%=;#$
"$!$$!xy1z
12345?qi+?@0)ATU$
q&$#"ab56#56!~W) ,-(&("++#BC -"ab#?\DRS$
"$$!{|}~
12345C9:0).W!@?E9:0)?STb!Xpq&$%"AB56#ls56N!_g
hqi+vZ~HF78WGHW@?8I$
"$!!"#$&!"%&&TUVWX&()8*+5
"$$!!!"#$
q&$&"ghij56#56!12345HZ~JCghW*p?/0ghij$
"$$$!!"%&
q&$"ghkl56#56!12345HZ~JCghW*p?/0ghkl$
"$$!TUVWX
C=#M<? 4&-KLsZ~JCEFGH
--- Chunk 2 ---
长度: 17 字符
内容:
I$
#
!"!"#"#!$%%&

BIN
data/pdf_chunks/page_8.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 458 KiB

View File

@@ -0,0 +1,47 @@
=== 第 8 页切片结果 ===
图片: page_8.png
MD5: b1bbea8b2299d31d99e5213e8511e935
文本长度: 754 字符
切片大小: 500 字符
切片数量: 2 个
--- Chunk 0 ---
长度: 500 字符
内容:
!!n!4 KL#hi?@t)k7""48Xp5DkR8MUfD|wMTNoMwk?MOO#
-KLO?@t)k7"N48Xp5DkR8MUfD|wMTNo?N7.H7#
< !,-./0TUVWX?12
"$$"!()
12345Z~CghW.kPQ?mno$<5GCgh<s2Hm$ghWb*p?R2il
%=a!4B[SPT9"RT)*?4Us?VW&v~p\&00 W]\!"00#
"$$(!*+5
U#gXpXb120)?12345$C56".56N$_WYZW56k[B?i+$q
&$(%pq+56&56mvZ~\Zt]pq+6?;#
"$"!@91
"$"$!!=>@91
12345q&$)%str/56&ls56m$Z~^d#
"$"$$!34@91
"$"$$$!!5#34?@91
12345q&$*%C"uv?r/56&ls56m$Z~C"^d#
"$"$$$$!5]34?@91
12345q&$!+%CNuv?r/56&ls56m$Z~CN^d#
"$(!+5
"$($!!67895
MC)_wxWyz)?[\"[]"[jn$CqM.NRSls56m
--- Chunk 1 ---
长度: 254 字符
内容:
$Z~HF78"["G
HW@?8IWa7~C?#b#
"$($$!:;<=
)\{|EFGHI?{|}~~bBc(+B d0jZ[B#
"$($!>?v@
"$($$!!ABCDEF~GH
a3DEFm3&WIJ?)*m$DEFH~C3gef?.gW=h$ijK.2DEFk
]"l+?\pk]mf#.gZ~nPDEF~C?ef$\pk]mfTDEFclVZ~p
\DEFof?"$&p$bDEF\pk]mf.gXb~CUqC3gFBof?afsrs_~
C?ef#
%
!"!"#"#!$%%&

BIN
data/pdf_chunks/page_9.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 658 KiB

View File

@@ -0,0 +1,66 @@
=== 第 9 页切片结果 ===
图片: page_9.png
MD5: a0f4f9f366a1ca5a79237df39dfcb248
文本长度: 1168 字符
切片大小: 500 字符
切片数量: 3 个
--- Chunk 0 ---
长度: 500 字符
内容:
"$($$$!ABC?}~
DEFq&$!!!DEFef56"ls56#Z~78$
"$($$!AIC
DNF~X12345?OCk7Mk7rs_fl?jY#tDNFu\\L2H#?@u\\
92Hm#vwT9?6D~Z]\%&(00$
"$($$"!AICJ&
DNFfl~GCD|W_r|GH#D|W_r|GH~bBc(+B ?d0jZ~[B$
xUPk?DNFZcyfg?^P$
"$($$(!ABC(8KL
q&$!"!DEFnyGH56"ls56m#DEF["lEFT9Z~C\j2Y$DEF%"
l+p_q+vZ~nP$
"$($"!KL
"$($"$!!KCDEF~
a3?Fm3&WIJ?)*m#?FH~C3gef?.gW=h#ijK.2?Fk]5z
?\pk]mf!:?@WIJA?\]Lf"$.gZ~nP?F~C?ef#\pk]mfT?F
{lVZ~p\?Fof?"p#b?F\pk].gXb~CUqC3gFBof?afsrs_
~C?ef$
"$($"$$!KLMN(8KL
C9:0)?STb#?@n^~b+zKny?@#0_Z~Cgh7CHY2$1234
--- Chunk 1 ---
长度: 500 字符
内容:
5q
&$%!AB56"ls56N#@q&$!#!?@IJnyGH56"ls56m#?@nyGH\j\?F
Cgh7CH|Z~CY2#b?Fj\5zZ~CR2$
"$($(!OP}~
q&$!%!%&56"ls56N#12345?}i2Z~HF(V#b}~?n!Wef_+$
"$($&!QRpST91
12345a3GC(#4q&$!&!()*+z56"ls56m#(p(.5T))
uZ~78W"a#b$
"$($#!YZ[\}~
12345a3GCJKLM#4q&$!!JKLMef56"ls56m#JKLMpLM[5
Twxi2Z~78W"a#b$
"$($)!NO
"$($)$!!NOpS
12345?BCH&b|~CBV)#ZEBCC3g1/?#$BV)#bQ2KO<5G?
B{+UBV)$
"$($)$$!NOPUV~
q&$!(!BCDKLf56"ls56#BC?\9uDK)Z~p\%+00$
"$&!WXGY8Z3[\
"$&$!!]^,-
/"!12345,-?N%~!",-./.0)uv#bH\&\(?i2#0qrG91Y
ZK0)12345#)0)Zt.
--- Chunk 2 ---
长度: 168 字符
内容:
k?PQ*A\9$
1"!t0)12.YZ+<,m-)X&+m!aC12345MT.%W_!GH.n.%W
C_!Gs.B"#~r6_st?3/$
2"!C,-./.0)12H~0)&$0>$/0(&+1(&n2(nYZ+<?>T~]\
Wn\x34T>#+<st?>T~]\Wn\p534T>$
&
!"!"#"#!$%%&

View File

@@ -3,5 +3,47 @@ name = "backend"
version = "0.1.0"
description = "Add your description here"
readme = "README.md"
requires-python = ">=3.9"
dependencies = []
requires-python = ">=3.12"
dependencies = [
# Web框架
"fastapi>=0.110.0",
"uvicorn>=0.27.0",
# LangGraph & LangChain
"langgraph>=0.0.40",
"langchain>=0.2.0",
"langchain-community>=0.2.0",
# DashScope
"dashscope>=1.14.0",
# Milvus
"pymilvus>=2.3.0",
# MinIO
"minio>=7.1.0",
# PostgreSQL
"sqlalchemy>=2.0.0",
"psycopg2-binary>=2.9.0",
# Redis (optional, for caching)
"redis>=5.0.0",
# 文档解析
"pypdf2>=3.0.0",
"python-docx>=1.1.0",
"pdfplumber>=0.10.0",
"pdf2image>=1.16.0",
"pillow>=11.3.0",
"pymupdf>=1.26.5",
"pytesseract>=0.3.13",
# Pydantic配置
"pydantic>=2.0.0",
"pydantic-settings>=2.0.0",
# 工具
"python-multipart>=0.0.9",
"sse-starlette>=1.8.0",
"python-dotenv>=1.0.0",
"tiktoken>=0.5.0",
"httpx>=0.25.0",
]
[dependency-groups]
dev = [
"pytest>=7.4.0",
"pytest-asyncio>=0.21.0",
]

View File

@@ -13,10 +13,21 @@ dashscope>=1.14.0
# Milvus
pymilvus>=2.3.0
# MinIO
minio>=7.1.0
# PostgreSQL
sqlalchemy>=2.0.0
psycopg2-binary>=2.9.0
# Redis (optional, for caching)
redis>=5.0.0
# 文档解析
pypdf2>=3.0.0
python-docx>=1.1.0
pdfplumber>=0.10.0
pdf2image>=1.16.0
# Pydantic配置
pydantic>=2.0.0

2714
uv.lock generated

File diff suppressed because it is too large Load Diff