feat(document): add database and process documents
This commit is contained in:
623
docs/architecture/document-core-processing-flow.md
Normal file
623
docs/architecture/document-core-processing-flow.md
Normal file
@@ -0,0 +1,623 @@
|
|||||||
|
# 核心文档处理主链路说明
|
||||||
|
|
||||||
|
本文件说明当前默认生产链路中的核心文档处理流程,也就是:
|
||||||
|
|
||||||
|
- `AliyunDocumentParser`
|
||||||
|
- `AliyunVectorChunkBuilder`
|
||||||
|
- `OpenAICompatibleEmbeddingProvider`
|
||||||
|
- `MilvusVectorIndex`
|
||||||
|
|
||||||
|
目标是回答四个核心问题:
|
||||||
|
|
||||||
|
1. `ParsedDocument` 为什么是多层结构
|
||||||
|
2. 这些结构分别保存到哪里
|
||||||
|
3. 哪一步才真正做了向量化
|
||||||
|
4. Milvus 里最后到底存的是什么
|
||||||
|
|
||||||
|
数据库表设计、关系模型、DDL 和 PostgreSQL 职责边界已经单独整理到 [document-processing-database-design.md](/abs/path/C:/Users/A200477427/Developers/AIRegulation/AIRegulation-DocAnalysis/docs/architecture/document-processing-database-design.md:1)。本文件保留流程视角,只在必要处给出与存储分工相关的摘要,不再作为数据库设计 authority。
|
||||||
|
|
||||||
|
## 1. 主链路总览
|
||||||
|
|
||||||
|
当前默认实现由 `DocumentCommandService.upload_and_process()` 统一编排。它不是“parse 完直接进向量库”,而是先生成结构化解析产物,再把其中适合检索的一层送去 embedding 和 Milvus。
|
||||||
|
|
||||||
|
```mermaid
|
||||||
|
sequenceDiagram
|
||||||
|
participant API as API / Service
|
||||||
|
participant MinIO as BinaryStore
|
||||||
|
participant Parser as AliyunDocumentParser
|
||||||
|
participant PG as DocumentRepository / ParseArtifactStore
|
||||||
|
participant Embed as EmbeddingProvider
|
||||||
|
participant Milvus as VectorIndex
|
||||||
|
|
||||||
|
API->>MinIO: 保存原始文件
|
||||||
|
API->>Parser: parse(file_path, doc_id, doc_name)
|
||||||
|
Parser-->>API: ParsedDocument
|
||||||
|
API->>MinIO: 保存 layouts/structure_nodes/semantic_blocks/vector_chunks JSON
|
||||||
|
API->>PG: 更新 documents.status=parsed
|
||||||
|
API->>PG: 保存 structure_nodes / semantic_blocks
|
||||||
|
API->>API: chunk_builder.build(parsed_document)
|
||||||
|
API->>Embed: embed_texts([chunk.embedding_text])
|
||||||
|
Embed-->>API: vectors
|
||||||
|
API->>Milvus: upsert(chunks, vectors)
|
||||||
|
API->>PG: 更新 documents.status=indexed
|
||||||
|
```
|
||||||
|
|
||||||
|
主链路编排代码在 [services.py](/abs/path/C:/Users/A200477427/Developers/AIRegulation/AIRegulation-DocAnalysis/backend/app/application/documents/services.py:83):
|
||||||
|
|
||||||
|
```python
|
||||||
|
def upload_and_process(
|
||||||
|
self,
|
||||||
|
*,
|
||||||
|
doc_id: str | None = None,
|
||||||
|
file_name: str,
|
||||||
|
content: bytes,
|
||||||
|
content_type: str,
|
||||||
|
doc_name: str | None,
|
||||||
|
regulation_type: str,
|
||||||
|
version: str,
|
||||||
|
generate_summary: bool,
|
||||||
|
) -> DocumentProcessResult:
|
||||||
|
doc_id = doc_id or str(uuid.uuid4())[:8]
|
||||||
|
final_doc_name = doc_name or file_name
|
||||||
|
object_name = f"{doc_id}/{file_name}"
|
||||||
|
|
||||||
|
self.document_repository.create(document)
|
||||||
|
self.binary_store.save(object_name=object_name, data=content, content_type=content_type, metadata={"doc_id": doc_id})
|
||||||
|
self.document_repository.update_status(doc_id, DocumentStatus.STORED)
|
||||||
|
|
||||||
|
parsed_document = self.parser.parse(file_path=temp_path, doc_id=doc_id, doc_name=final_doc_name)
|
||||||
|
artifact_keys = self._save_parse_artifacts(doc_id=doc_id, parsed_document=parsed_document)
|
||||||
|
self.document_repository.update_status(doc_id, DocumentStatus.PARSED, parser_name=parsed_document.parser_name, metadata={...})
|
||||||
|
|
||||||
|
if self.parse_artifact_store:
|
||||||
|
self.parse_artifact_store.save(doc_id, parsed_document.structure_nodes, parsed_document.semantic_blocks)
|
||||||
|
|
||||||
|
chunks = self.chunk_builder.build(parsed_document=parsed_document, regulation_type=regulation_type, version=version)
|
||||||
|
vectors = self.embedding_provider.embed_texts([chunk.embedding_text for chunk in chunks])
|
||||||
|
inserted = self.vector_index.upsert(chunks, vectors)
|
||||||
|
|
||||||
|
self.document_repository.update_status(doc_id, DocumentStatus.INDEXED, chunk_count=len(chunks), index_name=health.get("collection_name", ""), metadata={...})
|
||||||
|
```
|
||||||
|
|
||||||
|
默认绑定关系在 [bootstrap.py](/abs/path/C:/Users/A200477427/Developers/AIRegulation/AIRegulation-DocAnalysis/backend/app/shared/bootstrap.py:157):
|
||||||
|
|
||||||
|
```python
|
||||||
|
def get_parser():
|
||||||
|
if settings.parser_backend == "aliyun":
|
||||||
|
return AliyunDocumentParser()
|
||||||
|
return LocalDocumentParser()
|
||||||
|
|
||||||
|
|
||||||
|
def get_chunk_builder():
|
||||||
|
if settings.chunk_backend == "aliyun":
|
||||||
|
return AliyunVectorChunkBuilder()
|
||||||
|
return LocalRegulationChunkBuilder(...)
|
||||||
|
|
||||||
|
|
||||||
|
def get_embedding_provider() -> OpenAICompatibleEmbeddingProvider:
|
||||||
|
return OpenAICompatibleEmbeddingProvider()
|
||||||
|
|
||||||
|
|
||||||
|
def get_vector_index() -> VectorIndex:
|
||||||
|
return LazyVectorIndex(_build_vector_index)
|
||||||
|
```
|
||||||
|
|
||||||
|
也就是说,当前默认主链路是:
|
||||||
|
|
||||||
|
- parser: `AliyunDocumentParser`
|
||||||
|
- chunk builder: `AliyunVectorChunkBuilder`
|
||||||
|
- embedding provider: `OpenAICompatibleEmbeddingProvider`
|
||||||
|
- vector index: `MilvusVectorIndex`
|
||||||
|
|
||||||
|
## 2. `ParsedDocument` 为什么是三层结构
|
||||||
|
|
||||||
|
`ParsedDocument` 不是最终入库格式,而是 parser 输出给后续处理步骤的统一中间结构。它把“结构理解”和“向量检索准备”拆成了三层。
|
||||||
|
|
||||||
|
定义在 [models.py](/abs/path/C:/Users/A200477427/Developers/AIRegulation/AIRegulation-DocAnalysis/backend/app/domain/documents/models.py:49):
|
||||||
|
|
||||||
|
```python
|
||||||
|
@dataclass
|
||||||
|
class ParsedDocument:
|
||||||
|
doc_id: str
|
||||||
|
doc_name: str
|
||||||
|
structure_nodes: list[dict[str, Any]]
|
||||||
|
semantic_blocks: list[dict[str, Any]]
|
||||||
|
vector_chunks: list[dict[str, Any]]
|
||||||
|
parser_name: str
|
||||||
|
raw_text: str = ""
|
||||||
|
raw_layouts: list[dict[str, Any]] = field(default_factory=list)
|
||||||
|
metadata: dict[str, Any] = field(default_factory=dict)
|
||||||
|
```
|
||||||
|
|
||||||
|
这三层的职责不同:
|
||||||
|
|
||||||
|
- `structure_nodes`
|
||||||
|
- 标题层级骨架
|
||||||
|
- 描述“文档有哪些章、节、条”
|
||||||
|
- 用于保留结构,不直接做 embedding
|
||||||
|
|
||||||
|
- `semantic_blocks`
|
||||||
|
- 语义块层
|
||||||
|
- 把正文、表格、图片说明整理成连续的语义单元
|
||||||
|
- 是从原始 layout 到检索 chunk 之间的中间层
|
||||||
|
|
||||||
|
- `vector_chunks`
|
||||||
|
- 检索和向量化层
|
||||||
|
- 已经是适合送给 embedding 模型的 chunk 视图
|
||||||
|
- 后续 `ChunkBuilder` 基本就是把这层映射成统一 `Chunk`
|
||||||
|
|
||||||
|
### 2.1 这三层是怎么从 parser 结果生成的
|
||||||
|
|
||||||
|
`AliyunDocumentParser.parse()` 先通过网关拿到阿里云返回的 `layouts`,再把 `layouts` 转成三层结构。
|
||||||
|
|
||||||
|
代码在 [aliyun_document_parser.py](/abs/path/C:/Users/A200477427/Developers/AIRegulation/AIRegulation-DocAnalysis/backend/app/infrastructure/parser/aliyun_document_parser.py:28):
|
||||||
|
|
||||||
|
```python
|
||||||
|
def parse(self, *, file_path: str, doc_id: str, doc_name: str) -> ParsedDocument:
|
||||||
|
payload = self.gateway.parse_document(file_path=file_path)
|
||||||
|
layouts = payload.layouts
|
||||||
|
structure_nodes = build_structure_nodes(layouts)
|
||||||
|
semantic_blocks = build_semantic_blocks(layouts)
|
||||||
|
vector_chunks = build_vector_chunks(
|
||||||
|
semantic_blocks,
|
||||||
|
doc_id=doc_id,
|
||||||
|
doc_title=doc_name,
|
||||||
|
max_chars=MAX_CHARS,
|
||||||
|
overlap_chars=OVERLAP_CHARS,
|
||||||
|
)
|
||||||
|
raw_text = "\n\n".join(
|
||||||
|
block.get("text", "")
|
||||||
|
for block in semantic_blocks
|
||||||
|
if block.get("text")
|
||||||
|
)
|
||||||
|
return ParsedDocument(
|
||||||
|
doc_id=doc_id,
|
||||||
|
doc_name=doc_name,
|
||||||
|
structure_nodes=structure_nodes,
|
||||||
|
semantic_blocks=semantic_blocks,
|
||||||
|
vector_chunks=vector_chunks,
|
||||||
|
parser_name=self.parser_name,
|
||||||
|
raw_text=raw_text,
|
||||||
|
raw_layouts=layouts,
|
||||||
|
metadata={...},
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
也就是说:
|
||||||
|
|
||||||
|
- parser 原始输出是 `layouts`
|
||||||
|
- 当前系统真正消费的是 `ParsedDocument`
|
||||||
|
- `ParsedDocument` 是由 normalizer 从 `layouts` 规整出来的
|
||||||
|
|
||||||
|
### 2.2 第一层:`structure_nodes`
|
||||||
|
|
||||||
|
这一层只保留标题和层级。
|
||||||
|
|
||||||
|
代码在 [aliyun_layout_normalizer.py](/abs/path/C:/Users/A200477427/Developers/AIRegulation/AIRegulation-DocAnalysis/backend/app/infrastructure/parser/aliyun_layout_normalizer.py:85):
|
||||||
|
|
||||||
|
```python
|
||||||
|
def build_structure_nodes(layouts: list[dict[str, Any]]) -> list[dict[str, Any]]:
|
||||||
|
nodes: list[dict[str, Any]] = []
|
||||||
|
for layout in layouts:
|
||||||
|
if not is_title(layout):
|
||||||
|
continue
|
||||||
|
text = get_text(layout)
|
||||||
|
if not text or text in TOC_TITLES:
|
||||||
|
continue
|
||||||
|
nodes.append(
|
||||||
|
{
|
||||||
|
"unique_id": layout.get("uniqueId"),
|
||||||
|
"page": get_page(layout),
|
||||||
|
"index": layout.get("index", 0),
|
||||||
|
"level": layout.get("level", 0),
|
||||||
|
"title": text,
|
||||||
|
"type": layout.get("type"),
|
||||||
|
"sub_type": layout.get("subType"),
|
||||||
|
}
|
||||||
|
)
|
||||||
|
return nodes
|
||||||
|
```
|
||||||
|
|
||||||
|
示例:
|
||||||
|
|
||||||
|
```json
|
||||||
|
[
|
||||||
|
{
|
||||||
|
"unique_id": "l-title-001",
|
||||||
|
"page": 2,
|
||||||
|
"index": 11,
|
||||||
|
"level": 1,
|
||||||
|
"title": "1 范围",
|
||||||
|
"type": "title",
|
||||||
|
"sub_type": "para_title"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"unique_id": "l-title-002",
|
||||||
|
"page": 3,
|
||||||
|
"index": 18,
|
||||||
|
"level": 2,
|
||||||
|
"title": "1.1 适用对象",
|
||||||
|
"type": "title",
|
||||||
|
"sub_type": "para_title"
|
||||||
|
}
|
||||||
|
]
|
||||||
|
```
|
||||||
|
|
||||||
|
这层的意义是“保留目录树”,不是直接拿来检索。
|
||||||
|
|
||||||
|
### 2.3 第二层:`semantic_blocks`
|
||||||
|
|
||||||
|
这一层会把连续正文合并成一个语义块,也会单独处理表格和图片说明。
|
||||||
|
|
||||||
|
代码在 [aliyun_layout_normalizer.py](/abs/path/C:/Users/A200477427/Developers/AIRegulation/AIRegulation-DocAnalysis/backend/app/infrastructure/parser/aliyun_layout_normalizer.py:163):
|
||||||
|
|
||||||
|
```python
|
||||||
|
def build_semantic_blocks(layouts: list[dict[str, Any]]) -> list[dict[str, Any]]:
|
||||||
|
semantic_blocks: list[dict[str, Any]] = []
|
||||||
|
section_stack: list[dict[str, Any]] = []
|
||||||
|
pending_text_blocks: list[dict[str, Any]] = []
|
||||||
|
block_id = 1
|
||||||
|
|
||||||
|
for layout in layouts:
|
||||||
|
text = get_text(layout)
|
||||||
|
page = get_page(layout)
|
||||||
|
|
||||||
|
if is_title(layout):
|
||||||
|
block_id = flush_text_block(pending_text_blocks, semantic_blocks, block_id)
|
||||||
|
pending_text_blocks = []
|
||||||
|
section_stack = update_section_path(section_stack, layout)
|
||||||
|
continue
|
||||||
|
|
||||||
|
section_path = section_path_titles(section_stack)
|
||||||
|
section_title = section_path[-1] if section_path else "未分类"
|
||||||
|
section_level = len(section_path)
|
||||||
|
|
||||||
|
if is_table(layout):
|
||||||
|
...
|
||||||
|
semantic_blocks.append(
|
||||||
|
{
|
||||||
|
"semantic_id": f"semantic-{block_id}",
|
||||||
|
"block_type": "table",
|
||||||
|
"page_start": page,
|
||||||
|
"page_end": page,
|
||||||
|
"section_path": section_path,
|
||||||
|
"section_level": section_level,
|
||||||
|
"section_title": section_title,
|
||||||
|
"source_ids": [layout.get("uniqueId")],
|
||||||
|
"text": table_text,
|
||||||
|
}
|
||||||
|
)
|
||||||
|
continue
|
||||||
|
|
||||||
|
if is_text(layout) and text:
|
||||||
|
pending_text_blocks.append(
|
||||||
|
{
|
||||||
|
"page": page,
|
||||||
|
"text": text,
|
||||||
|
"unique_id": layout.get("uniqueId"),
|
||||||
|
"section_path": section_path,
|
||||||
|
"section_level": section_level,
|
||||||
|
"section_title": section_title,
|
||||||
|
}
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
正文合并后会形成类似这样的语义块:
|
||||||
|
|
||||||
|
```json
|
||||||
|
[
|
||||||
|
{
|
||||||
|
"semantic_id": "semantic-1",
|
||||||
|
"block_type": "section_text",
|
||||||
|
"page_start": 2,
|
||||||
|
"page_end": 2,
|
||||||
|
"section_path": ["1 范围", "1.1 适用对象"],
|
||||||
|
"section_level": 2,
|
||||||
|
"section_title": "1.1 适用对象",
|
||||||
|
"source_ids": ["l-text-001", "l-text-002"],
|
||||||
|
"text": "本标准适用于道路车辆动力电池系统的安全要求。企业应建立一致的测试和验证方法。"
|
||||||
|
}
|
||||||
|
]
|
||||||
|
```
|
||||||
|
|
||||||
|
## 3. 这些结构分别保存到哪里
|
||||||
|
|
||||||
|
### 3.1 原始文件和中间 artifacts 先落 MinIO
|
||||||
|
|
||||||
|
当前链路在上传阶段会先把原始文件保存到对象存储;解析完成后,又会把结构化中间产物保存为 JSON。
|
||||||
|
|
||||||
|
保存 artifacts 的代码在 [services.py](/abs/path/C:/Users/A200477427/Developers/AIRegulation/AIRegulation-DocAnalysis/backend/app/application/documents/services.py:62):
|
||||||
|
|
||||||
|
```python
|
||||||
|
def _save_parse_artifacts(self, *, doc_id: str, parsed_document: ParsedDocument) -> dict[str, str]:
|
||||||
|
prefix = f"{parsed_document.metadata.get('artifact_prefix', 'artifacts').strip('/')}/{doc_id}"
|
||||||
|
artifact_payloads = {
|
||||||
|
"layouts": parsed_document.raw_layouts,
|
||||||
|
"structure_nodes": parsed_document.structure_nodes,
|
||||||
|
"semantic_blocks": parsed_document.semantic_blocks,
|
||||||
|
"vector_chunks": parsed_document.vector_chunks,
|
||||||
|
}
|
||||||
|
artifact_keys: dict[str, str] = {}
|
||||||
|
for name, payload in artifact_payloads.items():
|
||||||
|
object_name = f"{prefix}/{name}.json"
|
||||||
|
self.binary_store.save(
|
||||||
|
object_name=object_name,
|
||||||
|
data=json.dumps(payload, ensure_ascii=False, indent=2).encode("utf-8"),
|
||||||
|
content_type="application/json",
|
||||||
|
metadata={"doc_id": doc_id, "artifact_type": name},
|
||||||
|
)
|
||||||
|
artifact_keys[name] = object_name
|
||||||
|
return artifact_keys
|
||||||
|
```
|
||||||
|
|
||||||
|
`DocumentBinaryStore` 的当前默认实现是 `MinioDocumentBinaryStore`,也就是:
|
||||||
|
|
||||||
|
- 原始上传文件进 MinIO
|
||||||
|
- `layouts.json` 进 MinIO
|
||||||
|
- `structure_nodes.json` 进 MinIO
|
||||||
|
- `semantic_blocks.json` 进 MinIO
|
||||||
|
- `vector_chunks.json` 进 MinIO
|
||||||
|
|
||||||
|
### 3.2 PostgreSQL 在流程中的职责摘要
|
||||||
|
|
||||||
|
当前流程中,PostgreSQL 承担的是“文档元数据 + 结构化快照”的职责,而不是向量或大对象存储:
|
||||||
|
|
||||||
|
- `documents` 保存当前文档主记录、状态、统计和索引信息
|
||||||
|
- `structure_nodes` 保存当前最新解析快照的目录结构
|
||||||
|
- `semantic_blocks` 保存当前最新解析快照的语义块结构
|
||||||
|
|
||||||
|
更完整的 PostgreSQL 设计,包括:
|
||||||
|
|
||||||
|
- `documents`
|
||||||
|
- `document_processing_runs`
|
||||||
|
- `document_status_history`
|
||||||
|
- `document_artifacts`
|
||||||
|
- `structure_nodes`
|
||||||
|
- `semantic_blocks`
|
||||||
|
|
||||||
|
见 [document-processing-database-design.md](/abs/path/C:/Users/A200477427/Developers/AIRegulation/AIRegulation-DocAnalysis/docs/architecture/document-processing-database-design.md:1)。
|
||||||
|
|
||||||
|
### 3.3 存储分工一览
|
||||||
|
|
||||||
|
| 数据层 | 保存位置 | 是否直接用于 embedding | 是否最终进入 Milvus | 主要用途 |
|
||||||
|
| --- | --- | --- | --- | --- |
|
||||||
|
| 原始文件 | MinIO | 否 | 否 | 保留原始上传文档 |
|
||||||
|
| `raw_layouts` | MinIO `layouts.json` | 否 | 否 | 保留 parser 原始返回 |
|
||||||
|
| `structure_nodes` | MinIO + PostgreSQL | 否 | 否 | 目录树、层级结构 |
|
||||||
|
| `semantic_blocks` | MinIO + PostgreSQL | 否 | 间接 | 语义单元、中间层 |
|
||||||
|
| `vector_chunks` | MinIO | 是 | 间接 | embedding 前的检索块 |
|
||||||
|
| `Chunk` | 内存态 + Milvus | 是 | 是 | 统一向量入库模型 |
|
||||||
|
| `documents` 元数据 | PostgreSQL | 否 | 否 | 处理状态、统计、索引信息 |
|
||||||
|
|
||||||
|
## 4. 哪一步才真正“变成向量”
|
||||||
|
|
||||||
|
这是整个流程最关键的点。
|
||||||
|
|
||||||
|
结论先说清楚:
|
||||||
|
|
||||||
|
- parse 不做向量化
|
||||||
|
- 保存 artifacts 不做向量化
|
||||||
|
- `ChunkBuilder.build()` 也不做向量化
|
||||||
|
- 只有 `EmbeddingProvider.embed_texts()` 才真正调用 embedding 模型
|
||||||
|
- 只有 `VectorIndex.upsert()` 才真正把向量写入向量库
|
||||||
|
|
||||||
|
### 4.1 `vector_chunks` 先被映射成统一 `Chunk`
|
||||||
|
|
||||||
|
`AliyunVectorChunkBuilder` 并不做 embedding,它只负责把 `ParsedDocument.vector_chunks` 转成领域层统一 `Chunk` 模型。
|
||||||
|
|
||||||
|
代码在 [vector_chunk_builder.py](/abs/path/C:/Users/A200477427/Developers/AIRegulation/AIRegulation-DocAnalysis/backend/app/infrastructure/parser/vector_chunk_builder.py:12):
|
||||||
|
|
||||||
|
```python
|
||||||
|
def build(
|
||||||
|
self,
|
||||||
|
*,
|
||||||
|
parsed_document: ParsedDocument,
|
||||||
|
regulation_type: str,
|
||||||
|
version: str,
|
||||||
|
) -> list[Chunk]:
|
||||||
|
chunks: list[Chunk] = []
|
||||||
|
for index, item in enumerate(parsed_document.vector_chunks):
|
||||||
|
content = item.get("content") or item.get("text") or ""
|
||||||
|
embedding_text = item.get("embedding_text") or content
|
||||||
|
if not embedding_text.strip():
|
||||||
|
continue
|
||||||
|
section_path = item.get("section_path") or []
|
||||||
|
section_title = item.get("section_title") or (section_path[-1] if section_path else "")
|
||||||
|
page_number = item.get("page_start") or item.get("page") or 0
|
||||||
|
chunk_id = item.get("chunk_id") or f"{parsed_document.doc_id}-chunk-{index}"
|
||||||
|
metadata = {k: v for k, v in item.items() if k not in {"content", "embedding_text"}}
|
||||||
|
chunks.append(
|
||||||
|
Chunk(
|
||||||
|
chunk_id=str(chunk_id),
|
||||||
|
doc_id=parsed_document.doc_id,
|
||||||
|
doc_name=parsed_document.doc_name,
|
||||||
|
content=content,
|
||||||
|
embedding_text=embedding_text,
|
||||||
|
section_title=section_title,
|
||||||
|
section_path=section_path,
|
||||||
|
page_number=int(page_number or 0),
|
||||||
|
regulation_type=regulation_type,
|
||||||
|
version=version,
|
||||||
|
semantic_id=item.get("semantic_id", ""),
|
||||||
|
block_type=item.get("block_type", ""),
|
||||||
|
metadata=metadata,
|
||||||
|
)
|
||||||
|
)
|
||||||
|
return chunks
|
||||||
|
```
|
||||||
|
|
||||||
|
`Chunk` 的定义在 [models.py](/abs/path/C:/Users/A200477427/Developers/AIRegulation/AIRegulation-DocAnalysis/backend/app/domain/documents/models.py:63):
|
||||||
|
|
||||||
|
```python
|
||||||
|
@dataclass
|
||||||
|
class Chunk:
|
||||||
|
chunk_id: str
|
||||||
|
doc_id: str
|
||||||
|
doc_name: str
|
||||||
|
content: str
|
||||||
|
embedding_text: str
|
||||||
|
section_title: str = ""
|
||||||
|
section_path: list[str] = field(default_factory=list)
|
||||||
|
page_number: int = 0
|
||||||
|
regulation_type: str = ""
|
||||||
|
version: str = ""
|
||||||
|
semantic_id: str = ""
|
||||||
|
block_type: str = ""
|
||||||
|
metadata: dict[str, Any] = field(default_factory=dict)
|
||||||
|
```
|
||||||
|
|
||||||
|
一个 `Chunk` 的典型样子如下:
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"chunk_id": "doc-001-chunk-1",
|
||||||
|
"doc_id": "doc-001",
|
||||||
|
"doc_name": "动力电池安全规范",
|
||||||
|
"content": "本标准适用于道路车辆动力电池系统的安全要求。企业应建立一致的测试和验证方法。",
|
||||||
|
"embedding_text": "标准:动力电池安全规范\n章节:1 范围 > 1.1 适用对象\n\n本标准适用于道路车辆动力电池系统的安全要求。企业应建立一致的测试和验证方法。",
|
||||||
|
"section_title": "1.1 适用对象",
|
||||||
|
"section_path": ["1 范围", "1.1 适用对象"],
|
||||||
|
"page_number": 2,
|
||||||
|
"regulation_type": "GB",
|
||||||
|
"version": "2025",
|
||||||
|
"semantic_id": "semantic-1",
|
||||||
|
"block_type": "section_text",
|
||||||
|
"metadata": {
|
||||||
|
"chunk_index": 1,
|
||||||
|
"piece_index": 1,
|
||||||
|
"source_ids": ["l-text-001", "l-text-002"]
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
这里最关键的是要区分两个字段:
|
||||||
|
|
||||||
|
- `content`
|
||||||
|
- 用于检索命中后的展示内容
|
||||||
|
- 更接近用户最终看到的正文片段
|
||||||
|
|
||||||
|
- `embedding_text`
|
||||||
|
- 用于送给 embedding 模型
|
||||||
|
- 比 `content` 多了“标准名 + 章节路径”的上下文
|
||||||
|
|
||||||
|
所以“向量化输入”不是纯正文,而是增强后的上下文文本。
|
||||||
|
|
||||||
|
### 4.2 真正调用 embedding API 的地方
|
||||||
|
|
||||||
|
真正把文本变成向量的是 `OpenAICompatibleEmbeddingProvider.embed_texts()`。
|
||||||
|
|
||||||
|
代码在 [openai_compatible_embedding_provider.py](/abs/path/C:/Users/A200477427/Developers/AIRegulation/AIRegulation-DocAnalysis/backend/app/infrastructure/embedding/openai_compatible_embedding_provider.py:64):
|
||||||
|
|
||||||
|
```python
|
||||||
|
def embed_texts(self, texts: list[str]) -> list[list[float]]:
|
||||||
|
if not texts:
|
||||||
|
return []
|
||||||
|
```
|
||||||
|
|
||||||
|
也就是说,只有在这一步:
|
||||||
|
|
||||||
|
- 输入:`list[str]` 的 `embedding_text`
|
||||||
|
- 输出:`list[list[float]]` 的 dense vectors
|
||||||
|
|
||||||
|
前面的 parse、normalizer、chunk builder 都只是准备文本,没有任何向量值产生。
|
||||||
|
|
||||||
|
### 4.3 真正把向量写进 Milvus 的地方
|
||||||
|
|
||||||
|
向量值生成之后,`MilvusVectorIndex.upsert()` 才会把 `Chunk + vector` 写入向量库。
|
||||||
|
|
||||||
|
代码在 [milvus_vector_index.py](/abs/path/C:/Users/A200477427/Developers/AIRegulation/AIRegulation-DocAnalysis/backend/app/infrastructure/vectorstore/milvus_vector_index.py:69):
|
||||||
|
|
||||||
|
```python
|
||||||
|
def upsert(self, chunks: list[Chunk], vectors: list[list[float]]) -> int:
|
||||||
|
if len(chunks) != len(vectors):
|
||||||
|
raise ValueError("chunks 与 vectors 数量不一致")
|
||||||
|
data = []
|
||||||
|
now = int(time.time())
|
||||||
|
for chunk, vector in zip(chunks, vectors):
|
||||||
|
data.append(
|
||||||
|
{
|
||||||
|
"id": chunk.chunk_id,
|
||||||
|
"doc_id": chunk.doc_id,
|
||||||
|
"doc_name": chunk.doc_name,
|
||||||
|
"content": chunk.content[:65535],
|
||||||
|
"embedding": vector,
|
||||||
|
"section_title": chunk.section_title[:512],
|
||||||
|
"section_path": json.dumps(chunk.section_path, ensure_ascii=False)[:4096],
|
||||||
|
"page_number": chunk.page_number,
|
||||||
|
"regulation_type": chunk.regulation_type[:128],
|
||||||
|
"version": chunk.version[:64],
|
||||||
|
"semantic_id": chunk.semantic_id[:128],
|
||||||
|
"block_type": chunk.block_type[:64],
|
||||||
|
"metadata_json": json.dumps(chunk.metadata, ensure_ascii=False)[:65535],
|
||||||
|
"created_at": now,
|
||||||
|
}
|
||||||
|
)
|
||||||
|
self.collection.insert(data)
|
||||||
|
self.collection.flush()
|
||||||
|
return len(data)
|
||||||
|
```
|
||||||
|
|
||||||
|
也就是说,Milvus 最终存进去的是:
|
||||||
|
|
||||||
|
- 主键:`chunk_id`
|
||||||
|
- 文档维度字段:`doc_id`、`doc_name`
|
||||||
|
- 检索展示字段:`content`
|
||||||
|
- 向量字段:`embedding`
|
||||||
|
- 过滤/回溯字段:`section_title`、`section_path`、`page_number`、`regulation_type`、`version`、`semantic_id`、`block_type`
|
||||||
|
- 附加元数据:`metadata_json`
|
||||||
|
|
||||||
|
## 5. Milvus 里最后到底存的是什么
|
||||||
|
|
||||||
|
### 5.1 Collection schema
|
||||||
|
|
||||||
|
当前 `MilvusVectorIndex` 初始化 collection 时定义的 schema 在 [milvus_vector_index.py](/abs/path/C:/Users/A200477427/Developers/AIRegulation/AIRegulation-DocAnalysis/backend/app/infrastructure/vectorstore/milvus_vector_index.py:37):
|
||||||
|
|
||||||
|
```python
|
||||||
|
schema = CollectionSchema(
|
||||||
|
fields=[
|
||||||
|
FieldSchema(name="id", dtype=DataType.VARCHAR, max_length=128, is_primary=True, auto_id=False),
|
||||||
|
FieldSchema(name="doc_id", dtype=DataType.VARCHAR, max_length=64),
|
||||||
|
FieldSchema(name="doc_name", dtype=DataType.VARCHAR, max_length=256),
|
||||||
|
FieldSchema(name="content", dtype=DataType.VARCHAR, max_length=65535),
|
||||||
|
FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=settings.embedding_dim),
|
||||||
|
FieldSchema(name="section_title", dtype=DataType.VARCHAR, max_length=512),
|
||||||
|
FieldSchema(name="section_path", dtype=DataType.VARCHAR, max_length=4096),
|
||||||
|
FieldSchema(name="page_number", dtype=DataType.INT64),
|
||||||
|
FieldSchema(name="regulation_type", dtype=DataType.VARCHAR, max_length=128),
|
||||||
|
FieldSchema(name="version", dtype=DataType.VARCHAR, max_length=64),
|
||||||
|
FieldSchema(name="semantic_id", dtype=DataType.VARCHAR, max_length=128),
|
||||||
|
FieldSchema(name="block_type", dtype=DataType.VARCHAR, max_length=64),
|
||||||
|
FieldSchema(name="metadata_json", dtype=DataType.VARCHAR, max_length=65535),
|
||||||
|
FieldSchema(name="created_at", dtype=DataType.INT64),
|
||||||
|
],
|
||||||
|
description="Dense-only regulations index",
|
||||||
|
enable_dynamic_field=False,
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
这说明 Milvus 存的不是“只有 embedding 的极简向量表”,而是:
|
||||||
|
|
||||||
|
- 一个 dense vector
|
||||||
|
- 一组检索时要返回或过滤的结构化字段
|
||||||
|
|
||||||
|
但要注意:这并不意味着 Milvus 是业务主记录库。它仍然主要服务于检索,而不是替代 PostgreSQL 的文档管理职责。
|
||||||
|
|
||||||
|
### 5.2 `list_documents()` 为什么会先看 Milvus
|
||||||
|
|
||||||
|
文档列表查询在 [services.py](/abs/path/C:/Users/A200477427/Developers/AIRegulation/AIRegulation-DocAnalysis/backend/app/application/documents/services.py:271) 中实现,它会:
|
||||||
|
|
||||||
|
1. 从 Milvus 查询当前真的存在向量的文档
|
||||||
|
2. 从文档元数据仓储加载文档记录
|
||||||
|
3. 以 Milvus 为索引状态真相源进行 merge
|
||||||
|
|
||||||
|
原因不是“Milvus 替代 PostgreSQL”,而是:
|
||||||
|
|
||||||
|
- `indexed` 这个状态最终是否真实成立,要看 Milvus 里有没有对应 chunk
|
||||||
|
- 但下载、删除、重试、文件定位、错误信息仍然要依赖文档元数据仓储
|
||||||
|
|
||||||
|
所以:
|
||||||
|
|
||||||
|
- Milvus 是“索引真相源”
|
||||||
|
- PostgreSQL/JSON 是“文档元数据真相源”
|
||||||
|
|
||||||
|
这两者职责不同,不能互相替代。
|
||||||
508
docs/architecture/document-processing-database-design.md
Normal file
508
docs/architecture/document-processing-database-design.md
Normal file
@@ -0,0 +1,508 @@
|
|||||||
|
# 文档处理链路数据库设计
|
||||||
|
|
||||||
|
## 1. Purpose
|
||||||
|
|
||||||
|
本文档定义当前文档处理主链路的 PostgreSQL 数据库设计,覆盖上传、解析、索引、状态查询、重试、删除这条核心链路,以及围绕该链路的常用运维与审计需求。
|
||||||
|
|
||||||
|
本文档的目标不是替代 [document-core-processing-flow.md](/abs/path/C:/Users/A200477427/Developers/AIRegulation/AIRegulation-DocAnalysis/docs/architecture/document-core-processing-flow.md:1) 的流程说明,而是补齐关系型存储的 authority,使后续从 JSON 元数据切换到 PostgreSQL 时有清晰、稳定、可实施的数据库设计基线。
|
||||||
|
|
||||||
|
## 1.1 Scope And Design Target
|
||||||
|
|
||||||
|
本文档只覆盖以下范围:
|
||||||
|
|
||||||
|
- 文档主记录
|
||||||
|
- 文档处理运行记录
|
||||||
|
- 文档状态历史
|
||||||
|
- 解析产物引用
|
||||||
|
- 当前最新结构化解析快照
|
||||||
|
|
||||||
|
本文档不覆盖以下范围:
|
||||||
|
|
||||||
|
- Agent 会话
|
||||||
|
- 反馈和人工审核
|
||||||
|
- 合规分析任务
|
||||||
|
- Milvus collection schema 的详细实现
|
||||||
|
|
||||||
|
设计原则采用 `Compat First`:
|
||||||
|
|
||||||
|
- 保持与当前 `DocumentRepository` / `ParseArtifactStore` 主流程兼容
|
||||||
|
- 新增关系表以补足运维与审计能力
|
||||||
|
- 不为了理想化模型而反推大规模接口重写
|
||||||
|
|
||||||
|
## 2. Storage Responsibilities
|
||||||
|
|
||||||
|
当前系统采用三类存储,各自职责必须清晰分离:
|
||||||
|
|
||||||
|
| 存储 | 保存内容 | 是否业务主记录 | 说明 |
|
||||||
|
| --- | --- | --- | --- |
|
||||||
|
| MinIO | 原始文件、`layouts.json`、`structure_nodes.json`、`semantic_blocks.json`、`vector_chunks.json` | 否 | 负责大对象与产物归档,不承担关系查询 |
|
||||||
|
| Milvus | chunk 级向量和检索辅助字段 | 否 | 负责向量检索,不承担文档生命周期管理 |
|
||||||
|
| PostgreSQL | 文档元数据、处理状态、结构化快照、处理历史、artifact 引用 | 是 | 负责文档管理、运维可观测性和关系查询 |
|
||||||
|
|
||||||
|
约束说明:
|
||||||
|
|
||||||
|
- PostgreSQL 不保存 embedding 向量。
|
||||||
|
- PostgreSQL 不新增 `vector_chunks` 内容表。
|
||||||
|
- Milvus 可以保存 `doc_id`、`doc_name`、`regulation_type`、`version` 等检索辅助字段,但不是业务真相源。
|
||||||
|
- 文档下载、删除、重试仍以 PostgreSQL 中的文档主记录为入口。
|
||||||
|
|
||||||
|
## 3. Design Overview
|
||||||
|
|
||||||
|
### 3.1 Entity Responsibilities
|
||||||
|
|
||||||
|
数据库采用“当前态主记录 + 当前快照 + 历史过程”的分层模型:
|
||||||
|
|
||||||
|
- `documents`
|
||||||
|
- 当前文档主记录
|
||||||
|
- 保存供管理、下载、重试、删除直接使用的元数据和当前状态
|
||||||
|
- `document_processing_runs`
|
||||||
|
- 每次上传或重试对应一次处理运行
|
||||||
|
- 保存运行级统计、阶段时间点和失败信息
|
||||||
|
- `document_status_history`
|
||||||
|
- 追加式状态事件流
|
||||||
|
- 保存每次状态变更的上下文
|
||||||
|
- `document_artifacts`
|
||||||
|
- 保存 MinIO artifact 的引用信息
|
||||||
|
- 不保存 artifact 内容本体
|
||||||
|
- `structure_nodes`
|
||||||
|
- 当前最新解析快照中的目录结构
|
||||||
|
- `semantic_blocks`
|
||||||
|
- 当前最新解析快照中的语义块结构
|
||||||
|
|
||||||
|
### 3.2 Current Snapshot Vs Historical Records
|
||||||
|
|
||||||
|
本设计显式区分两类数据:
|
||||||
|
|
||||||
|
- 当前快照
|
||||||
|
- `documents`
|
||||||
|
- `structure_nodes`
|
||||||
|
- `semantic_blocks`
|
||||||
|
- 历史过程
|
||||||
|
- `document_processing_runs`
|
||||||
|
- `document_status_history`
|
||||||
|
- `document_artifacts`
|
||||||
|
|
||||||
|
其中:
|
||||||
|
|
||||||
|
- `structure_nodes` 和 `semantic_blocks` 只保存“最新一次成功解析后”的当前快照
|
||||||
|
- 历史版本回溯依赖 `document_processing_runs`、`document_artifacts` 和 MinIO 中对应 run 的 artifact 文件
|
||||||
|
|
||||||
|
## 4. Table Design
|
||||||
|
|
||||||
|
### 4.1 `documents`
|
||||||
|
|
||||||
|
用途:
|
||||||
|
|
||||||
|
- 作为文档生命周期的主记录表
|
||||||
|
- 为下载、删除、重试、管理列表、状态查询提供当前态真相
|
||||||
|
|
||||||
|
字段设计:
|
||||||
|
|
||||||
|
```sql
|
||||||
|
CREATE TABLE IF NOT EXISTS documents (
|
||||||
|
doc_id VARCHAR(128) PRIMARY KEY,
|
||||||
|
doc_name VARCHAR(512) NOT NULL DEFAULT '',
|
||||||
|
file_name VARCHAR(512) NOT NULL DEFAULT '',
|
||||||
|
object_name VARCHAR(1024) NOT NULL DEFAULT '',
|
||||||
|
content_type VARCHAR(128) NOT NULL DEFAULT '',
|
||||||
|
size_bytes BIGINT NOT NULL DEFAULT 0,
|
||||||
|
status VARCHAR(32) NOT NULL DEFAULT 'pending',
|
||||||
|
regulation_type VARCHAR(128) NOT NULL DEFAULT '',
|
||||||
|
version VARCHAR(64) NOT NULL DEFAULT '',
|
||||||
|
summary TEXT NOT NULL DEFAULT '',
|
||||||
|
summary_latency_ms INTEGER NOT NULL DEFAULT 0,
|
||||||
|
chunk_count INTEGER NOT NULL DEFAULT 0,
|
||||||
|
parser_name VARCHAR(128) NOT NULL DEFAULT '',
|
||||||
|
index_name VARCHAR(128) NOT NULL DEFAULT '',
|
||||||
|
error_message TEXT NOT NULL DEFAULT '',
|
||||||
|
metadata JSONB NOT NULL DEFAULT '{}'::jsonb,
|
||||||
|
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
|
||||||
|
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
|
||||||
|
CONSTRAINT chk_documents_status
|
||||||
|
CHECK (status IN ('pending', 'stored', 'parsed', 'indexed', 'failed'))
|
||||||
|
);
|
||||||
|
|
||||||
|
CREATE INDEX IF NOT EXISTS idx_documents_status_updated_at
|
||||||
|
ON documents(status, updated_at DESC);
|
||||||
|
|
||||||
|
CREATE INDEX IF NOT EXISTS idx_documents_regulation_version
|
||||||
|
ON documents(regulation_type, version);
|
||||||
|
|
||||||
|
CREATE INDEX IF NOT EXISTS idx_documents_updated_at
|
||||||
|
ON documents(updated_at DESC);
|
||||||
|
```
|
||||||
|
|
||||||
|
字段说明:
|
||||||
|
|
||||||
|
- `object_name`
|
||||||
|
- 原始上传文件在 MinIO 中的对象路径
|
||||||
|
- 当前实现依赖该字段完成下载、重试和删除,v1 不拆分为独立文件表
|
||||||
|
- `status`
|
||||||
|
- 当前文档处理状态
|
||||||
|
- 仅表示当前态,不承担历史审计职责
|
||||||
|
- `metadata`
|
||||||
|
- 保存轻量、变动频率较高、暂不值得列式建模的附加信息
|
||||||
|
- 典型内容包括 `parse_task_id`、`processing_stage`、`artifact_keys`、统计计数等
|
||||||
|
|
||||||
|
### 4.2 `document_processing_runs`
|
||||||
|
|
||||||
|
用途:
|
||||||
|
|
||||||
|
- 记录一次上传或一次重试的完整处理运行
|
||||||
|
- 用于解释“这份文档本次处理为什么成功或失败”
|
||||||
|
|
||||||
|
字段设计:
|
||||||
|
|
||||||
|
```sql
|
||||||
|
CREATE TABLE IF NOT EXISTS document_processing_runs (
|
||||||
|
run_id BIGSERIAL PRIMARY KEY,
|
||||||
|
doc_id VARCHAR(128) NOT NULL,
|
||||||
|
trigger_type VARCHAR(16) NOT NULL,
|
||||||
|
run_status VARCHAR(16) NOT NULL,
|
||||||
|
parser_backend VARCHAR(64) NOT NULL DEFAULT '',
|
||||||
|
chunk_backend VARCHAR(64) NOT NULL DEFAULT '',
|
||||||
|
embedding_model VARCHAR(128) NOT NULL DEFAULT '',
|
||||||
|
index_name VARCHAR(128) NOT NULL DEFAULT '',
|
||||||
|
started_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
|
||||||
|
stored_at TIMESTAMPTZ,
|
||||||
|
parsed_at TIMESTAMPTZ,
|
||||||
|
indexed_at TIMESTAMPTZ,
|
||||||
|
finished_at TIMESTAMPTZ,
|
||||||
|
layout_count INTEGER NOT NULL DEFAULT 0,
|
||||||
|
structure_node_count INTEGER NOT NULL DEFAULT 0,
|
||||||
|
semantic_block_count INTEGER NOT NULL DEFAULT 0,
|
||||||
|
vector_chunk_count INTEGER NOT NULL DEFAULT 0,
|
||||||
|
chunk_count INTEGER NOT NULL DEFAULT 0,
|
||||||
|
failure_stage VARCHAR(32) NOT NULL DEFAULT '',
|
||||||
|
error_message TEXT NOT NULL DEFAULT '',
|
||||||
|
metadata JSONB NOT NULL DEFAULT '{}'::jsonb,
|
||||||
|
CONSTRAINT fk_runs_document
|
||||||
|
FOREIGN KEY (doc_id) REFERENCES documents(doc_id) ON DELETE CASCADE,
|
||||||
|
CONSTRAINT chk_runs_trigger_type
|
||||||
|
CHECK (trigger_type IN ('upload', 'retry')),
|
||||||
|
CONSTRAINT chk_runs_status
|
||||||
|
CHECK (run_status IN ('running', 'succeeded', 'failed'))
|
||||||
|
);
|
||||||
|
|
||||||
|
CREATE INDEX IF NOT EXISTS idx_runs_doc_started_at
|
||||||
|
ON document_processing_runs(doc_id, started_at DESC);
|
||||||
|
|
||||||
|
CREATE INDEX IF NOT EXISTS idx_runs_status_started_at
|
||||||
|
ON document_processing_runs(run_status, started_at DESC);
|
||||||
|
```
|
||||||
|
|
||||||
|
字段说明:
|
||||||
|
|
||||||
|
- `trigger_type`
|
||||||
|
- 标识该次处理由首次上传还是 retry 触发
|
||||||
|
- `run_status`
|
||||||
|
- 只表示该次运行的最终结果
|
||||||
|
- `failure_stage`
|
||||||
|
- 建议取值与应用层关键阶段一致,例如 `store`、`parse`、`artifact_persist`、`embed`、`index`
|
||||||
|
- `metadata`
|
||||||
|
- 保存运行级附加上下文,例如配置快照、后端实现名、provider 返回信息摘要
|
||||||
|
|
||||||
|
### 4.3 `document_status_history`
|
||||||
|
|
||||||
|
用途:
|
||||||
|
|
||||||
|
- 保存状态变化事件流
|
||||||
|
- 用于排障、审计和运行轨迹分析
|
||||||
|
|
||||||
|
字段设计:
|
||||||
|
|
||||||
|
```sql
|
||||||
|
CREATE TABLE IF NOT EXISTS document_status_history (
|
||||||
|
event_id BIGSERIAL PRIMARY KEY,
|
||||||
|
doc_id VARCHAR(128) NOT NULL,
|
||||||
|
run_id BIGINT,
|
||||||
|
from_status VARCHAR(32) NOT NULL DEFAULT '',
|
||||||
|
to_status VARCHAR(32) NOT NULL,
|
||||||
|
stage VARCHAR(32) NOT NULL DEFAULT '',
|
||||||
|
message TEXT NOT NULL DEFAULT '',
|
||||||
|
metadata JSONB NOT NULL DEFAULT '{}'::jsonb,
|
||||||
|
occurred_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
|
||||||
|
CONSTRAINT fk_status_document
|
||||||
|
FOREIGN KEY (doc_id) REFERENCES documents(doc_id) ON DELETE CASCADE,
|
||||||
|
CONSTRAINT fk_status_run
|
||||||
|
FOREIGN KEY (run_id) REFERENCES document_processing_runs(run_id) ON DELETE CASCADE,
|
||||||
|
CONSTRAINT chk_status_history_to_status
|
||||||
|
CHECK (to_status IN ('pending', 'stored', 'parsed', 'indexed', 'failed'))
|
||||||
|
);
|
||||||
|
|
||||||
|
CREATE INDEX IF NOT EXISTS idx_status_history_doc_occurred_at
|
||||||
|
ON document_status_history(doc_id, occurred_at DESC);
|
||||||
|
|
||||||
|
CREATE INDEX IF NOT EXISTS idx_status_history_run_occurred_at
|
||||||
|
ON document_status_history(run_id, occurred_at DESC);
|
||||||
|
```
|
||||||
|
|
||||||
|
字段说明:
|
||||||
|
|
||||||
|
- `from_status` 可以为空字符串
|
||||||
|
- 用于首个事件,例如文档创建时进入 `pending`
|
||||||
|
- `stage`
|
||||||
|
- 用于记录状态推进对应的业务阶段
|
||||||
|
- `message`
|
||||||
|
- 用于记录面向排障的人类可读说明
|
||||||
|
|
||||||
|
### 4.4 `document_artifacts`
|
||||||
|
|
||||||
|
用途:
|
||||||
|
|
||||||
|
- 保存解析产物在 MinIO 中的位置与基本属性
|
||||||
|
- 支持后续定位某次 run 的 artifacts,而不扫描对象存储
|
||||||
|
|
||||||
|
字段设计:
|
||||||
|
|
||||||
|
```sql
|
||||||
|
CREATE TABLE IF NOT EXISTS document_artifacts (
|
||||||
|
artifact_id BIGSERIAL PRIMARY KEY,
|
||||||
|
doc_id VARCHAR(128) NOT NULL,
|
||||||
|
run_id BIGINT,
|
||||||
|
artifact_type VARCHAR(32) NOT NULL,
|
||||||
|
object_name VARCHAR(1024) NOT NULL,
|
||||||
|
content_type VARCHAR(128) NOT NULL DEFAULT 'application/json',
|
||||||
|
byte_size BIGINT NOT NULL DEFAULT 0,
|
||||||
|
checksum VARCHAR(128) NOT NULL DEFAULT '',
|
||||||
|
metadata JSONB NOT NULL DEFAULT '{}'::jsonb,
|
||||||
|
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
|
||||||
|
CONSTRAINT fk_artifacts_document
|
||||||
|
FOREIGN KEY (doc_id) REFERENCES documents(doc_id) ON DELETE CASCADE,
|
||||||
|
CONSTRAINT fk_artifacts_run
|
||||||
|
FOREIGN KEY (run_id) REFERENCES document_processing_runs(run_id) ON DELETE CASCADE,
|
||||||
|
CONSTRAINT chk_artifact_type
|
||||||
|
CHECK (artifact_type IN ('layouts', 'structure_nodes', 'semantic_blocks', 'vector_chunks'))
|
||||||
|
);
|
||||||
|
|
||||||
|
CREATE INDEX IF NOT EXISTS idx_artifacts_doc_created_at
|
||||||
|
ON document_artifacts(doc_id, created_at DESC);
|
||||||
|
|
||||||
|
CREATE INDEX IF NOT EXISTS idx_artifacts_run_type
|
||||||
|
ON document_artifacts(run_id, artifact_type);
|
||||||
|
|
||||||
|
CREATE UNIQUE INDEX IF NOT EXISTS uq_artifacts_run_type_object
|
||||||
|
ON document_artifacts(run_id, artifact_type, object_name);
|
||||||
|
```
|
||||||
|
|
||||||
|
字段说明:
|
||||||
|
|
||||||
|
- 该表只记录 artifact 引用,不记录原始文件
|
||||||
|
- 原始文件仍由 `documents.object_name` 表达,这是为了保持当前下载和重试逻辑兼容
|
||||||
|
|
||||||
|
### 4.5 `structure_nodes`
|
||||||
|
|
||||||
|
用途:
|
||||||
|
|
||||||
|
- 保存当前最新解析快照中的标题层级结构
|
||||||
|
- 供目录树查询、结构化浏览、调试和审计使用
|
||||||
|
|
||||||
|
字段设计:
|
||||||
|
|
||||||
|
```sql
|
||||||
|
CREATE TABLE IF NOT EXISTS structure_nodes (
|
||||||
|
id BIGSERIAL PRIMARY KEY,
|
||||||
|
doc_id VARCHAR(128) NOT NULL,
|
||||||
|
unique_id VARCHAR(128),
|
||||||
|
page INTEGER NOT NULL DEFAULT 0,
|
||||||
|
idx INTEGER NOT NULL DEFAULT 0,
|
||||||
|
level INTEGER NOT NULL DEFAULT 0,
|
||||||
|
title TEXT NOT NULL DEFAULT '',
|
||||||
|
type VARCHAR(64),
|
||||||
|
sub_type VARCHAR(64),
|
||||||
|
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
|
||||||
|
CONSTRAINT fk_structure_nodes_document
|
||||||
|
FOREIGN KEY (doc_id) REFERENCES documents(doc_id) ON DELETE CASCADE
|
||||||
|
);
|
||||||
|
|
||||||
|
CREATE INDEX IF NOT EXISTS idx_structure_nodes_doc_idx
|
||||||
|
ON structure_nodes(doc_id, idx);
|
||||||
|
|
||||||
|
CREATE INDEX IF NOT EXISTS idx_structure_nodes_doc_level
|
||||||
|
ON structure_nodes(doc_id, level);
|
||||||
|
```
|
||||||
|
|
||||||
|
设计约束:
|
||||||
|
|
||||||
|
- 该表表示当前快照,不做多版本建模
|
||||||
|
- 新一轮成功解析会覆盖同一 `doc_id` 的旧快照
|
||||||
|
|
||||||
|
### 4.6 `semantic_blocks`
|
||||||
|
|
||||||
|
用途:
|
||||||
|
|
||||||
|
- 保存当前最新解析快照中的语义块
|
||||||
|
- 供结构回溯、调试和后续关系型查询使用
|
||||||
|
|
||||||
|
字段设计:
|
||||||
|
|
||||||
|
```sql
|
||||||
|
CREATE TABLE IF NOT EXISTS semantic_blocks (
|
||||||
|
id BIGSERIAL PRIMARY KEY,
|
||||||
|
doc_id VARCHAR(128) NOT NULL,
|
||||||
|
semantic_id VARCHAR(128) NOT NULL,
|
||||||
|
block_type VARCHAR(64) NOT NULL DEFAULT '',
|
||||||
|
page_start INTEGER NOT NULL DEFAULT 0,
|
||||||
|
page_end INTEGER NOT NULL DEFAULT 0,
|
||||||
|
section_path JSONB NOT NULL DEFAULT '[]'::jsonb,
|
||||||
|
section_level INTEGER NOT NULL DEFAULT 0,
|
||||||
|
section_title VARCHAR(512) NOT NULL DEFAULT '',
|
||||||
|
source_ids JSONB NOT NULL DEFAULT '[]'::jsonb,
|
||||||
|
text TEXT NOT NULL DEFAULT '',
|
||||||
|
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
|
||||||
|
CONSTRAINT fk_semantic_blocks_document
|
||||||
|
FOREIGN KEY (doc_id) REFERENCES documents(doc_id) ON DELETE CASCADE,
|
||||||
|
CONSTRAINT uq_semantic_blocks_doc_semantic
|
||||||
|
UNIQUE (doc_id, semantic_id)
|
||||||
|
);
|
||||||
|
|
||||||
|
CREATE INDEX IF NOT EXISTS idx_semantic_blocks_doc_id
|
||||||
|
ON semantic_blocks(doc_id);
|
||||||
|
|
||||||
|
CREATE INDEX IF NOT EXISTS idx_semantic_blocks_doc_section_title
|
||||||
|
ON semantic_blocks(doc_id, section_title);
|
||||||
|
|
||||||
|
CREATE INDEX IF NOT EXISTS idx_semantic_blocks_doc_block_type
|
||||||
|
ON semantic_blocks(doc_id, block_type);
|
||||||
|
```
|
||||||
|
|
||||||
|
设计约束:
|
||||||
|
|
||||||
|
- 该表表示当前快照,不保存历史版本
|
||||||
|
- 历史回溯应通过 run 对应的 artifact 文件完成
|
||||||
|
|
||||||
|
## 5. Relationship Model
|
||||||
|
|
||||||
|
实体关系如下:
|
||||||
|
|
||||||
|
```mermaid
|
||||||
|
erDiagram
|
||||||
|
documents ||--o{ document_processing_runs : has
|
||||||
|
documents ||--o{ document_status_history : has
|
||||||
|
documents ||--o{ document_artifacts : has
|
||||||
|
documents ||--o{ structure_nodes : has
|
||||||
|
documents ||--o{ semantic_blocks : has
|
||||||
|
document_processing_runs ||--o{ document_status_history : emits
|
||||||
|
document_processing_runs ||--o{ document_artifacts : produces
|
||||||
|
```
|
||||||
|
|
||||||
|
关系语义:
|
||||||
|
|
||||||
|
- `documents` 是聚合根
|
||||||
|
- `document_processing_runs` 记录一次完整处理尝试
|
||||||
|
- `document_status_history` 记录状态推进轨迹
|
||||||
|
- `document_artifacts` 记录 MinIO 中可回放的结构化产物
|
||||||
|
- `structure_nodes` / `semantic_blocks` 代表“当前版本”的关系型快照
|
||||||
|
|
||||||
|
## 6. Flow-To-Table Mapping
|
||||||
|
|
||||||
|
### 6.1 Upload
|
||||||
|
|
||||||
|
上传开始时:
|
||||||
|
|
||||||
|
1. 创建 `documents`
|
||||||
|
2. 创建一条 `document_processing_runs`
|
||||||
|
3. 写入一条 `document_status_history`,`to_status='pending'`
|
||||||
|
|
||||||
|
### 6.2 Store Original File
|
||||||
|
|
||||||
|
原始文件写入 MinIO 成功后:
|
||||||
|
|
||||||
|
1. 更新 `documents.status='stored'`
|
||||||
|
2. 更新当前 run 的 `stored_at`
|
||||||
|
3. 追加 `document_status_history`
|
||||||
|
|
||||||
|
### 6.3 Parse And Persist Artifacts
|
||||||
|
|
||||||
|
解析成功后:
|
||||||
|
|
||||||
|
1. 更新当前 run 的 `parsed_at`
|
||||||
|
2. 更新 run 的 `layout_count`、`structure_node_count`、`semantic_block_count`、`vector_chunk_count`
|
||||||
|
3. 更新 `documents.status='parsed'`
|
||||||
|
4. 刷新 `structure_nodes`
|
||||||
|
5. 刷新 `semantic_blocks`
|
||||||
|
6. 为 `layouts`、`structure_nodes`、`semantic_blocks`、`vector_chunks` 写入 `document_artifacts`
|
||||||
|
7. 追加 `document_status_history`
|
||||||
|
|
||||||
|
### 6.4 Embed And Index
|
||||||
|
|
||||||
|
向量化和入库成功后:
|
||||||
|
|
||||||
|
1. 更新当前 run 的 `indexed_at`、`finished_at`
|
||||||
|
2. 更新当前 run 的 `run_status='succeeded'`
|
||||||
|
3. 更新 `documents.status='indexed'`
|
||||||
|
4. 更新 `documents.chunk_count`、`index_name`
|
||||||
|
5. 追加 `document_status_history`
|
||||||
|
|
||||||
|
### 6.5 Failure
|
||||||
|
|
||||||
|
任一阶段失败时:
|
||||||
|
|
||||||
|
1. 更新当前 run 的 `run_status='failed'`
|
||||||
|
2. 记录 `failure_stage` 和 `error_message`
|
||||||
|
3. 更新 `finished_at`
|
||||||
|
4. 更新 `documents.status='failed'`
|
||||||
|
5. 更新 `documents.error_message`
|
||||||
|
6. 追加 `document_status_history`
|
||||||
|
|
||||||
|
### 6.6 Retry
|
||||||
|
|
||||||
|
重试时:
|
||||||
|
|
||||||
|
1. 保留现有 `documents.doc_id`
|
||||||
|
2. 新建一条 `document_processing_runs`
|
||||||
|
3. 为本次重试重新写入状态历史
|
||||||
|
4. 本次重试成功后覆盖 `structure_nodes` / `semantic_blocks` 当前快照
|
||||||
|
5. 历史 run 和 artifact 记录继续保留
|
||||||
|
|
||||||
|
### 6.7 Delete
|
||||||
|
|
||||||
|
删除文档时:
|
||||||
|
|
||||||
|
1. 应用层先删除 MinIO 原始文件和 artifacts
|
||||||
|
2. 应用层删除 Milvus 中按 `doc_id` 关联的向量
|
||||||
|
3. 最后删除 `documents`
|
||||||
|
4. 依赖外键 `ON DELETE CASCADE` 清理 run、status history、artifacts、structure nodes、semantic blocks
|
||||||
|
|
||||||
|
## 7. Alignment With Current Backend
|
||||||
|
|
||||||
|
### 7.1 Compatible Parts
|
||||||
|
|
||||||
|
当前代码已天然兼容以下设计:
|
||||||
|
|
||||||
|
- `documents`
|
||||||
|
- `structure_nodes`
|
||||||
|
- `semantic_blocks`
|
||||||
|
- 当前快照覆盖式更新
|
||||||
|
- `doc_id` 作为跨 MinIO / Milvus / PostgreSQL 的统一关联键
|
||||||
|
|
||||||
|
### 7.2 Required Future Additions
|
||||||
|
|
||||||
|
若后续正式切到 PostgreSQL 默认元数据后端,应新增以下内部 store 或 repository:
|
||||||
|
|
||||||
|
- `DocumentProcessingRunStore`
|
||||||
|
- `DocumentStatusEventStore`
|
||||||
|
- `DocumentArtifactStore`
|
||||||
|
|
||||||
|
这些新增能力属于内部增强,不要求修改现有 HTTP API。
|
||||||
|
|
||||||
|
### 7.3 Migration Guidance
|
||||||
|
|
||||||
|
从当前 JSON 元数据切换到 PostgreSQL 时,建议按以下顺序进行:
|
||||||
|
|
||||||
|
1. 迁移 `documents.json` 中已有文档主记录到 `documents`
|
||||||
|
2. 将 `DOCUMENT_REPOSITORY_BACKEND` 切换为 `postgres`
|
||||||
|
3. 为新上传或重试的文档开始写入 run / status history / artifact records
|
||||||
|
4. 历史文档若缺少 run 级数据,可允许为空,不阻塞切换
|
||||||
|
|
||||||
|
## 8. Non-Goals
|
||||||
|
|
||||||
|
以下能力不在本设计 v1 范围内:
|
||||||
|
|
||||||
|
- 将 Milvus 替换为 PostgreSQL 向量能力
|
||||||
|
- 在 PostgreSQL 中保存向量字段
|
||||||
|
- 为 `vector_chunks` 建独立关系表
|
||||||
|
- 为 `structure_nodes` / `semantic_blocks` 建历史版本仓库
|
||||||
|
- 将原始文件抽象成独立 `document_files` 表
|
||||||
|
|
||||||
|
这些能力可能在后续重构时被讨论,但不应影响当前主链路切换和现有应用层兼容性。
|
||||||
@@ -31,6 +31,7 @@ dependencies = [
|
|||||||
"redis>=4.5.0",
|
"redis>=4.5.0",
|
||||||
"minio>=7.1.0",
|
"minio>=7.1.0",
|
||||||
"psycopg2-binary>=2.9.0",
|
"psycopg2-binary>=2.9.0",
|
||||||
|
"sqlalchemy>=2.0.0",
|
||||||
]
|
]
|
||||||
|
|
||||||
[dependency-groups]
|
[dependency-groups]
|
||||||
|
|||||||
@@ -32,6 +32,7 @@ redis>=4.5.0
|
|||||||
minio>=7.1.0
|
minio>=7.1.0
|
||||||
|
|
||||||
# 数据库
|
# 数据库
|
||||||
|
sqlalchemy>=2.0.0
|
||||||
psycopg2-binary>=2.9.0
|
psycopg2-binary>=2.9.0
|
||||||
# mysql-connector-python>=8.0.0
|
# mysql-connector-python>=8.0.0
|
||||||
|
|
||||||
|
|||||||
Reference in New Issue
Block a user