feat: Migrate document parsing to Aliyun and update embedding configurations
- Updated LocalDocumentParser to include raw_layouts and artifact_prefix from settings. - Added new documents with failure reasons and metadata to documents.json for better error tracking. - Created a new documentation file detailing the Aliyun ingest implementation process. - Updated RFC to reflect changes in the parsing backend and embedding dimensions. - Modified tests to accommodate the new embedding dimension of 1024 and updated parser and chunk builder assertions. - Verified migration configurations to ensure correct settings for embedding model and backend.
This commit is contained in:
@@ -29,7 +29,7 @@
|
||||
已确认的目标需求如下:
|
||||
|
||||
- 文档解析统一改为阿里云文档智能能力
|
||||
- 当前阿里云接入基础来自 `backend/app/aliyun_parser/parse_pdf.py`
|
||||
- 当前阿里云接入基础已经迁移到 `backend/app/infrastructure/parser/aliyun_layout_normalizer.py`
|
||||
- 解析结果以 `structure_nodes`、`semantic_blocks`、`vector_chunks` 三层结构为基础
|
||||
- 分块以阿里云 `vector_chunks` 为准,不再走当前本地 `RegulationChunker`
|
||||
- embedding 改为 OpenAI 兼容 API 调用,模型使用 `text-embedding-v3`
|
||||
@@ -80,7 +80,7 @@
|
||||
受影响的解析能力范围包括:
|
||||
|
||||
- 当前本地 parser 目录
|
||||
- `backend/app/aliyun_parser`
|
||||
- `backend/app/infrastructure/parser`
|
||||
|
||||
迁移后阿里云文档智能能力将成为主解析来源,本地 PDF/DOCX/MinerU 解析链路需要重新界定保留、下线或回退策略,但具体模块组织方式不在本文件内定义。
|
||||
|
||||
@@ -133,7 +133,7 @@
|
||||
以下风险和约束在本期已经明确,需要在后续架构和实施阶段优先处理:
|
||||
|
||||
- 旧 Milvus collection 与新 `1536` 维 schema 不兼容,需要新 collection 和重建索引
|
||||
- `backend/app/aliyun_parser` 现有脚本含硬编码密钥,后续必须全部移到环境变量
|
||||
- 阿里云凭据必须继续只通过环境变量或凭据链注入,不能回到脚本内硬编码
|
||||
- RAG 下游当前对 `clause_number` 有依赖,迁移后需要优先适配 `section_title` 和 Aliyun chunk metadata
|
||||
- 如果阿里云返回字段与当前样例不同,需要在架构阶段补充 adapter 层
|
||||
|
||||
|
||||
Reference in New Issue
Block a user