feat: Migrate document parsing to Aliyun and update embedding configurations

- Updated LocalDocumentParser to include raw_layouts and artifact_prefix from settings. - Added new documents with failure reasons and metadata to documents.json for better error tracking. - Created a new documentation file detailing the Aliyun ingest implementation process. - Updated RFC to reflect changes in the parsing backend and embedding dimensions. - Modified tests to accommodate the new embedding dimension of 1024 and updated parser and chunk builder assertions. - Verified migration configurations to ensure correct settings for embedding model and backend.
2026-05-18 22:30:28 +08:00
parent 3f69cad404
commit c22b03dc07
26 changed files with 1092 additions and 6500 deletions
--- a/docs/architecture/aliyun-ingest-implementation.md
+++ b/docs/architecture/aliyun-ingest-implementation.md
@@ -0,0 +1,71 @@
+# 阿里云解析主链路实现说明
+
+本文档描述当前仓库已经落地的文档 ingest 主链路实现，作为迁移设计到代码实现之间的收口说明。
+
+## 1. 当前默认链路
+
+- 上传入口保持为 `/api/v1/documents/upload`
+- 默认 `PARSER_BACKEND=aliyun`
+- 默认 `CHUNK_BACKEND=aliyun`
+- 默认 Milvus collection 为 `regulations_dense_1536_v2`
+- 解析产物落到 MinIO `artifacts/{doc_id}/`
+
+完整主链路如下：
+
+1. 原始文件上传到 MinIO
+2. `AliyunDocmindGateway` 提交阿里云异步解析任务
+3. 轮询任务状态直到成功或超时
+4. 分页拉取 `layouts`
+5. 转换为 `structure_nodes / semantic_blocks / vector_chunks`
+6. 三层结构 JSON 回写 MinIO
+7. 使用 `vector_chunks[*].embedding_text` 调 embedding API
+8. 写入 `regulations_dense_1536_v2`
+9. 文档状态更新为 `indexed`
+
+运行时转换逻辑位于 `backend/app/infrastructure/parser/aliyun_layout_normalizer.py`。
+旧的 `backend/app/aliyun_parser/` 示例目录已移除，不参与生产运行时。
+
+## 2. 解析产物持久化
+
+每个文档会额外写入以下对象：
+
+- `artifacts/{doc_id}/layouts.json`
+- `artifacts/{doc_id}/structure_nodes.json`
+- `artifacts/{doc_id}/semantic_blocks.json`
+- `artifacts/{doc_id}/vector_chunks.json`
+
+`documents.json` 仅保留对象 key、统计信息和处理阶段，不保存完整大 JSON。
+
+## 3. 失败策略
+
+- 当前 `PARSER_FAILURE_MODE=fail`
+- 阿里云解析失败不自动回退到本地 parser
+- 失败时保留原始文件与已写入的 artifacts，便于排障
+
+## 4. 运行参数
+
+关键环境变量如下：
+
+- `ALIBABA_ACCESS_KEY_ID`
+- `ALIBABA_ACCESS_KEY_SECRET`
+- `ALIBABA_ENDPOINT`
+- `ALIYUN_PARSE_POLL_INTERVAL_SECONDS`
+- `ALIYUN_PARSE_TIMEOUT_SECONDS`
+- `ALIYUN_PARSE_LAYOUT_STEP_SIZE`
+- `ALIYUN_LLM_ENHANCEMENT`
+- `ALIYUN_ENHANCEMENT_MODE`
+- `DOCUMENT_PARSE_ARTIFACT_PREFIX`
+- `PARSER_BACKEND`
+- `CHUNK_BACKEND`
+
+## 5. 运行态确认
+
+可通过 `/api/v1/status/config` 确认以下字段：
+
+- `parser_backend`
+- `chunk_backend`
+- `milvus_collection`
+- `artifact_prefix`
+- `parser_failure_mode`
+
+这几个值用于确认服务是否实际运行在迁移后的默认链路上。
--- a/docs/rfc/backend-api-parsing-embedding-migration-requirements.md
+++ b/docs/rfc/backend-api-parsing-embedding-migration-requirements.md
@@ -29,7 +29,7 @@
 已确认的目标需求如下：

 - 文档解析统一改为阿里云文档智能能力
- 当前阿里云接入基础来自 `backend/app/aliyun_parser/parse_pdf.py`
+- 当前阿里云接入基础已经迁移到 `backend/app/infrastructure/parser/aliyun_layout_normalizer.py`
 - 解析结果以 `structure_nodes`、`semantic_blocks`、`vector_chunks` 三层结构为基础
 - 分块以阿里云 `vector_chunks` 为准，不再走当前本地 `RegulationChunker`
 - embedding 改为 OpenAI 兼容 API 调用，模型使用 `text-embedding-v3`
@@ -80,7 +80,7 @@
 受影响的解析能力范围包括：

 - 当前本地 parser 目录
- `backend/app/aliyun_parser`
+- `backend/app/infrastructure/parser`

 迁移后阿里云文档智能能力将成为主解析来源，本地 PDF/DOCX/MinerU 解析链路需要重新界定保留、下线或回退策略，但具体模块组织方式不在本文件内定义。

@@ -133,7 +133,7 @@
 以下风险和约束在本期已经明确，需要在后续架构和实施阶段优先处理：

 - 旧 Milvus collection 与新 `1536` 维 schema 不兼容，需要新 collection 和重建索引
- `backend/app/aliyun_parser` 现有脚本含硬编码密钥，后续必须全部移到环境变量
+- 阿里云凭据必须继续只通过环境变量或凭据链注入，不能回到脚本内硬编码
 - RAG 下游当前对 `clause_number` 有依赖，迁移后需要优先适配 `section_title` 和 Aliyun chunk metadata
 - 如果阿里云返回字段与当前样例不同，需要在架构阶段补充 adapter 层