feat: Migrate document parsing to Aliyun and update embedding configurations
- Updated LocalDocumentParser to include raw_layouts and artifact_prefix from settings. - Added new documents with failure reasons and metadata to documents.json for better error tracking. - Created a new documentation file detailing the Aliyun ingest implementation process. - Updated RFC to reflect changes in the parsing backend and embedding dimensions. - Modified tests to accommodate the new embedding dimension of 1024 and updated parser and chunk builder assertions. - Verified migration configurations to ensure correct settings for embedding model and backend.
This commit is contained in:
17
README.md
17
README.md
@@ -39,7 +39,7 @@ AIRegulation-DocAnalysis-Demo/
|
||||
### 1. 安装依赖
|
||||
|
||||
```bash
|
||||
pip install -r backend/requirements.txt
|
||||
./dev.sh setup
|
||||
```
|
||||
|
||||
### 2. 启动Milvus向量数据库
|
||||
@@ -57,7 +57,7 @@ docker-compose logs -f milvus
|
||||
### 3. 启动API服务
|
||||
|
||||
```bash
|
||||
PYTHONPATH=backend uvicorn app.main:app --reload --port 8000
|
||||
./dev.sh start api --foreground
|
||||
```
|
||||
|
||||
访问API文档:http://localhost:8000/docs
|
||||
@@ -104,6 +104,8 @@ MILVUS_PORT=19530
|
||||
# 阿里云文档解析
|
||||
ALIBABA_ACCESS_KEY_ID=your_aliyun_access_key_id
|
||||
ALIBABA_ACCESS_KEY_SECRET=your_aliyun_access_key_secret
|
||||
PARSER_BACKEND=aliyun
|
||||
CHUNK_BACKEND=aliyun
|
||||
|
||||
# embedding 配置
|
||||
EMBEDDING_MODEL=text-embedding-v3
|
||||
@@ -121,6 +123,17 @@ CHUNK_SIZE=512
|
||||
- 混合检索问答功能
|
||||
- 法规变更监控与自动更新
|
||||
|
||||
## 解析产物
|
||||
|
||||
上传成功后,系统会把阿里云解析的中间结果持久化到 MinIO:
|
||||
|
||||
- `artifacts/{doc_id}/layouts.json`
|
||||
- `artifacts/{doc_id}/structure_nodes.json`
|
||||
- `artifacts/{doc_id}/semantic_blocks.json`
|
||||
- `artifacts/{doc_id}/vector_chunks.json`
|
||||
|
||||
当前默认 Milvus collection 为 `regulations_dense_1536_v2`。
|
||||
|
||||
## 许可证
|
||||
|
||||
MIT License
|
||||
|
||||
Reference in New Issue
Block a user