feat: Migrate document parsing to Aliyun and update embedding configurations

- Updated LocalDocumentParser to include raw_layouts and artifact_prefix from settings.
- Added new documents with failure reasons and metadata to documents.json for better error tracking.
- Created a new documentation file detailing the Aliyun ingest implementation process.
- Updated RFC to reflect changes in the parsing backend and embedding dimensions.
- Modified tests to accommodate the new embedding dimension of 1024 and updated parser and chunk builder assertions.
- Verified migration configurations to ensure correct settings for embedding model and backend.
This commit is contained in:
ash66
2026-05-18 22:30:28 +08:00
parent 3f69cad404
commit c22b03dc07
26 changed files with 1092 additions and 6500 deletions

View File

@@ -39,7 +39,7 @@ AIRegulation-DocAnalysis-Demo/
### 1. 安装依赖
```bash
pip install -r backend/requirements.txt
./dev.sh setup
```
### 2. 启动Milvus向量数据库
@@ -57,7 +57,7 @@ docker-compose logs -f milvus
### 3. 启动API服务
```bash
PYTHONPATH=backend uvicorn app.main:app --reload --port 8000
./dev.sh start api --foreground
```
访问API文档http://localhost:8000/docs
@@ -104,6 +104,8 @@ MILVUS_PORT=19530
# 阿里云文档解析
ALIBABA_ACCESS_KEY_ID=your_aliyun_access_key_id
ALIBABA_ACCESS_KEY_SECRET=your_aliyun_access_key_secret
PARSER_BACKEND=aliyun
CHUNK_BACKEND=aliyun
# embedding 配置
EMBEDDING_MODEL=text-embedding-v3
@@ -121,6 +123,17 @@ CHUNK_SIZE=512
- 混合检索问答功能
- 法规变更监控与自动更新
## 解析产物
上传成功后,系统会把阿里云解析的中间结果持久化到 MinIO
- `artifacts/{doc_id}/layouts.json`
- `artifacts/{doc_id}/structure_nodes.json`
- `artifacts/{doc_id}/semantic_blocks.json`
- `artifacts/{doc_id}/vector_chunks.json`
当前默认 Milvus collection 为 `regulations_dense_1536_v2`
## 许可证
MIT License