9 Commits

Author SHA1 Message Date
wangwei
3674f9171e Add Frontend 增加一层渐变背景色 2026-05-27 10:30:35 +08:00
ash66
30c7bda389 Refactor document handling and update Milvus collection settings
- Removed multiple failed document entries from `documents.json`.
- Added a new document entry with updated metadata and changed the index name to `regulations_dense_1024_v2`.
- Updated architecture documentation to reflect changes in the Milvus collection name.
- Adjusted requirements by removing the sqlalchemy dependency.
- Modified test cases to align with new document structure and naming conventions.
- Introduced a new test file for Milvus vector index runtime recovery and error handling.
- Updated assertions in various test files to ensure compatibility with the new schema.
2026-05-26 20:21:31 +08:00
ash66
fec22a3a2c Fix centered content layout widths 2026-05-26 12:34:12 +08:00
ash66
34d72d7ce9 feat(document): add database and process documents 2026-05-26 10:08:56 +08:00
ash66
987cc097da feat: implement new layout components and routing structure
- Added HeaderLayout component for the application header.
- Introduced KeepAliveViewport for managing tab states and rendering.
- Created TabNav for tab navigation with animated indicator.
- Removed old Tabs component in favor of new layout structure.
- Updated routing with AppRouter and defined appTabs for navigation.
- Enhanced theme context to manage dark mode styles.
- Added new UI components: Badge, Button, Separator, and Tabs.
- Refactored pages to utilize new layout components and improve responsiveness.
- Updated global styles for better theming and layout consistency.
- Introduced TypeScript path aliases for cleaner imports.
2026-05-25 16:19:18 +08:00
ash66
10a034e294 feat(bootstrap): refactor runtime dependency management and add lazy loading for binary store and vector index
feat(agent): update import for agent session service
feat(openai): add context truncation check in OpenAI answer generator
docs(README): update frontend environment file conventions
fix(vite): default local frontend development to local backend
2026-05-25 13:58:48 +08:00
ash66
091a02c522 Add AgentSessionService and refactor agent routes
Move session-related responsibilities into a new application-layer AgentSessionService (and AgentSessionFeedbackResult dataclass), provide a bootstrap factory (get_agent_session_service), and update agent API routes to call the service instead of accessing ConversationStore directly. Routes now translate ValueError into 404 responses and use service methods for get/list/history/delete/feedback. Also update package exports and docs/READMEs to declare the backend architecture authority, enforce api -> application -> domain ports -> infrastructure boundaries, and call out legacy services/workflows as migration-only. These changes centralize session logic in the application layer and tighten architecture guidance for future backend work.
2026-05-22 09:50:30 +08:00
wangwei
37f7a60b0a feat(perception): 智能感知模块 - event feed, SSE impact analysis, tab registration 2026-05-22 00:42:28 +08:00
wangwei
f9ee644f25 feat(perception): backend - mock event store, perception service, /perception API routes 2026-05-22 00:33:43 +08:00
114 changed files with 18174 additions and 1639 deletions

2
.env
View File

@@ -9,7 +9,7 @@ DEBUG=false
# ===== Milvus向量数据库配置已有=====
MILVUS_HOST=6.86.80.8
MILVUS_PORT=19530
MILVUS_COLLECTION=regulations_dense_1024_v1
MILVUS_COLLECTION=regulations_dense_1024_v2
MILVUS_DB_NAME=default
MILVUS_INDEX_TYPE=IVF_FLAT
MILVUS_NLIST=128

View File

@@ -4,7 +4,7 @@
# ===== Milvus向量数据库配置已有=====
MILVUS_HOST=6.86.80.8
MILVUS_PORT=19530
MILVUS_COLLECTION=regulations_dense_1024_v1
MILVUS_COLLECTION=regulations_dense_1024_v2
MILVUS_DB_NAME=default
MILVUS_INDEX_TYPE=IVF_FLAT
MILVUS_NLIST=128

View File

@@ -9,7 +9,7 @@ DEBUG=false
# ===== Milvus向量数据库配置 =====
MILVUS_HOST=6.86.80.8
MILVUS_PORT=19530
MILVUS_COLLECTION=regulations_dense_1024_v1
MILVUS_COLLECTION=regulations_dense_1024_v2
MILVUS_DB_NAME=default
MILVUS_INDEX_TYPE=IVF_FLAT
MILVUS_NLIST=128

3
.gitignore vendored
View File

@@ -59,3 +59,6 @@ Thumbs.db
# logs files
logs/
# codex
.agents

View File

@@ -4,6 +4,12 @@
- Backend code lives under `backend/app/`; frontend is the Vite app in `frontend/`.
## Frontend UX Constraints
- Frontend work in `frontend/` must target desktop Web first.
- Do not proactively add mobile-specific adaptations, responsive reflow for small screens, or mobile-first layout compromises unless the user explicitly asks for them.
- When desktop and mobile requirements conflict, preserve the desktop Web layout and interaction model by default.
## Entrypoints
- Backend entrypoint is `backend/app/main.py`, which re-exports `app` from `app.api.main`.
@@ -39,6 +45,15 @@
- `tests/verify_mvp.py` also expects the `BGEM3Embedder` stack to be available and explicitly mentions `FlagEmbedding`.
- For backend-only changes, prefer focused import/startup checks unless you know the external services and model dependencies are available.
## Backend Architecture Authority
- `docs/architecture/backend-project-architecture.md` is the authoritative backend architecture document for ongoing backend development.
- New backend business logic must follow `api -> application -> domain ports -> infrastructure`.
- Treat `backend/app/shared/bootstrap.py` as the current composition root for backend dependency wiring.
- Do not add new business orchestration to `backend/app/services/*` or `backend/app/workflows/*` unless the task is explicitly a migration step.
- API routes must not directly access `ConversationStore`; session access should go through application services.
- Legacy files may be patched for compatibility or bug fixes, but should not gain new long-term responsibilities.
## Backend Commenting Standard
- All comments and docstrings in `backend/**/*.py` must be written in English.

View File

@@ -105,7 +105,7 @@ ALIBABA_ACCESS_KEY_ID=your_aliyun_access_key_id
ALIBABA_ACCESS_KEY_SECRET=your_aliyun_access_key_secret
EMBEDDING_API_KEY=your_embedding_api_key_here
EMBEDDING_MODEL=text-embedding-v3
EMBEDDING_DIM=1536
EMBEDDING_DIM=1024
PARSER_BACKEND=aliyun
CHUNK_BACKEND=aliyun
PARSER_FAILURE_MODE=fail

139
README.md
View File

@@ -1,139 +0,0 @@
# AI+合规智能中枢 - 法律法规文档解析入库
面向车企与工厂的合规智能平台,实现法规文档的解析、分块、嵌入和向量存储。
## MVP功能
本次实现的核心功能(最小可用版本):
- ✅ PDF/DOC/DOCX 文档解析(阿里云文档智能)
- ✅ 基于阿里云 `vector_chunks` 的统一切片
- ✅ OpenAI 兼容 embedding`text-embedding-v3`1536维
- ✅ Milvus 向量数据库存储与 dense-only 检索
- ✅ FastAPI接口封装
## 项目结构
```text
AIRegulation-DocAnalysis-Demo/
├── backend/
│ ├── app/
│ │ ├── api/ # FastAPI 接口层
│ │ ├── application/ # 用例编排层
│ │ ├── domain/ # 领域模型与稳定端口
│ │ ├── infrastructure/ # MinIO / Milvus / 阿里云 / embedding / session 适配
│ │ ├── config/ # 配置与日志
│ │ └── workers/
│ ├── requirements.txt
│ └── main.py
├── frontend/ # Vite React 前端
├── tests/ # 根级测试,导入 backend/app
├── docker/
│ └── docker-compose.yml
├── pyproject.toml
└── .env.example
```
## 快速开始
### 1. 安装依赖
```bash
./dev.sh setup
```
### 2. 启动Milvus向量数据库
```bash
cd docker
docker-compose up -d
```
等待Milvus启动完成约30秒
```bash
docker-compose logs -f milvus
```
### 3. 启动API服务
```bash
./dev.sh start api --foreground
```
访问API文档http://localhost:8000/docs
## API接口
### 上传文档
```bash
curl -X POST http://localhost:8000/api/v1/documents/upload \
-F "file=@your_regulation.pdf" \
-F "doc_name=GB 7258-2017" \
-F "regulation_type=车辆安全"
```
### 检索法规
```bash
curl -X POST http://localhost:8000/api/v1/knowledge/search \
-H "Content-Type: application/json" \
-d '{"query": "机动车安全技术要求", "top_k": 10}'
```
## 技术栈
| 类别 | 技术 |
|------|------|
| 文档解析 | 阿里云文档智能 + python-docx |
| 分块策略 | 阿里云 `vector_chunks` |
| 嵌入模型 | `text-embedding-v3`1536维 Dense |
| 向量数据库 | Milvus 2.4本地Docker部署 |
| 检索方式 | Dense-only 检索 |
| API框架 | FastAPI |
## 配置
创建 `.env` 文件(参考 `.env.example`
```env
# Milvus配置
MILVUS_HOST=localhost
MILVUS_PORT=19530
# 阿里云文档解析
ALIBABA_ACCESS_KEY_ID=your_aliyun_access_key_id
ALIBABA_ACCESS_KEY_SECRET=your_aliyun_access_key_secret
PARSER_BACKEND=aliyun
CHUNK_BACKEND=aliyun
# embedding 配置
EMBEDDING_MODEL=text-embedding-v3
EMBEDDING_DIM=1536
EMBEDDING_API_KEY=your_embedding_api_key_here
# 分块配置
CHUNK_SIZE=512
```
## 后续迭代不在本次MVP范围
- LLM摘要生成当前上传主链路默认不生成
- 文档上传UI界面
- 混合检索问答功能
- 法规变更监控与自动更新
## 解析产物
上传成功后,系统会把阿里云解析的中间结果持久化到 MinIO
- `artifacts/{doc_id}/layouts.json`
- `artifacts/{doc_id}/structure_nodes.json`
- `artifacts/{doc_id}/semantic_blocks.json`
- `artifacts/{doc_id}/vector_chunks.json`
当前默认 Milvus collection 为 `regulations_dense_1536_v2`
## 许可证
MIT License

View File

@@ -2,6 +2,13 @@
`backend` 是当前正式使用的 FastAPI 后端目录,入口为 `app.main:app`
## 架构约束入口
- Backend authoritative architecture 文档:`docs/architecture/backend-project-architecture.md`
- Backend migration RFC`docs/rfc/backend-api-parsing-embedding-migration-requirements.md`
- 后续 backend 新增功能和重构默认遵守:`api -> application -> domain ports -> infrastructure`
- `backend/app/services/*``backend/app/workflows/*` 为迁移期 legacy 目录,除迁移或兼容修复外,不应新增业务编排逻辑。
## 启动
```bash
@@ -34,10 +41,15 @@ PYTHONPATH=backend uvicorn app.main:app --host 0.0.0.0 --port 8000 --reload
```text
backend/
├── app/
│ ├── api/ # FastAPI 路由与模型
│ ├── config/ # 配置与日志
│ ├── services/ # 文档处理、LLM、RAG、存储
── workers/ # 任务相关代码
│ ├── api/ # FastAPI 路由与 transport models
│ ├── application/ # 用例编排层
│ ├── domain/ # 核心业务模型与稳定端口
── infrastructure/ # 外部系统适配器
│ ├── shared/ # composition root 与横切支撑
│ ├── config/ # 配置与日志
│ ├── services/ # legacy façade / 兼容入口
│ ├── workflows/ # legacy workflow 入口
│ └── workers/ # 任务相关代码
├── .env.example
├── requirements.txt
└── main.py
@@ -46,4 +58,13 @@ backend/
## 说明
- 路由前缀保持为 `/api/v1`,以兼容当前前端。
- `backend/app/api/routes/docs.py``rag.py``compliance.py``status.py` 仍保留在仓库中,但不再作为主路由入口
- 当前主业务链路入口是 `documents``knowledge``agent`
- `compliance.py` 当前仍被挂载,但尚未满足目标架构约束;在迁移前不应继续扩展业务编排。
- `docs.py``rag.py` 为遗留/非主入口,不应继续扩展。
## 开发约束
- backend 开发前先阅读 `docs/architecture/backend-project-architecture.md`
- 新增业务能力默认落在 `application` 层,由 `api` 调用,不要直接写进 route。
- route 不应直接访问 MinIO、Milvus、Parser SDK、LLM SDK 或 `ConversationStore`
- `backend/app/shared/bootstrap.py` 是当前 composition root依赖装配优先收口到这里。

View File

@@ -0,0 +1,8 @@
{
"permissions": {
"allow": [
"Bash(python3 *)",
"Bash(PGPASSWORD=postgresql123456 psql *)"
]
}
}

View File

@@ -0,0 +1,475 @@
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
阿里云文档智能 API 解析 PDF输出三层结构 chunks
- structure_nodes: 目录树结构
- semantic_blocks: 语义块(章节文本、表格、图片)
- vector_chunks: 检索块(带 overlap 切分)
"""
import argparse
import json
import re
import time
from pathlib import Path
from typing import Dict, List
from alibabacloud_docmind_api20220711.client import Client as DocmindClient
from alibabacloud_tea_openapi import models as open_api_models
from alibabacloud_docmind_api20220711 import models as docmind_models
from alibabacloud_tea_util import models as util_models
# ===================== 阿里云配置 =====================
ALIBABA_ACCESS_KEY_ID = "LTAI5t6fWvAsvZkoF9WTbtys"
ALIBABA_ACCESS_KEY_SECRET = "WX4oaE4FLYRa5L85TMQkqRPHeTJAF0"
ALIBABA_ENDPOINT = "docmind-api.cn-hangzhou.aliyuncs.com"
# ===================== 切分参数 =====================
MAX_CHARS = 600
OVERLAP_CHARS = 80
# ===================== 布局类型常量 =====================
TOC_TITLES = {"目次", "目录"}
TITLE_SUBTYPES = {"doc_title", "para_title"}
TEXT_SUBTYPES = {"para", "none"}
FIGURE_TYPES = {"figure", "figure_name", "figure_note"}
FIGURE_SUBTYPES = {"picture", "pic_title", "pic_caption"}
# ===================== 阿里云 API 客户端 =====================
def init_client() -> DocmindClient:
config = open_api_models.Config(
access_key_id=ALIBABA_ACCESS_KEY_ID,
access_key_secret=ALIBABA_ACCESS_KEY_SECRET,
)
config.endpoint = ALIBABA_ENDPOINT
return DocmindClient(config)
def submit_job(client: DocmindClient, file_path: str) -> str:
"""提交文档解析任务"""
file_name = Path(file_path).name
request = docmind_models.SubmitDocParserJobAdvanceRequest(
file_url_object=open(file_path, "rb"),
file_name=file_name,
file_name_extension=Path(file_path).suffix.lstrip("."),
llm_enhancement=True,
enhancement_mode="VLM",
)
runtime = util_models.RuntimeOptions()
response = client.submit_doc_parser_job_advance(request, runtime)
return response.body.data.id
def query_status(client: DocmindClient, task_id: str) -> Dict:
"""查询任务状态"""
request = docmind_models.QueryDocParserStatusRequest(id=task_id)
response = client.query_doc_parser_status(request)
return response.body.data.to_map() if response.body.data else None
def wait_for_completion(client: DocmindClient, task_id: str, poll_interval: int = 5) -> bool:
"""等待任务完成"""
while True:
status_data = query_status(client, task_id)
if not status_data:
return False
status = status_data.get("Status", "").lower()
if status == "success":
return True
elif status == "failed":
print(f"任务失败: {status_data}")
return False
print(f"任务状态: {status}, 等待中...")
time.sleep(poll_interval)
def get_result(client: DocmindClient, task_id: str, layout_num: int = 0, layout_step_size: int = 50) -> Dict:
"""获取解析结果"""
request = docmind_models.GetDocParserResultRequest(
id=task_id,
layout_step_size=layout_step_size,
layout_num=layout_num,
)
response = client.get_doc_parser_result(request)
return response.body.data if response.body.data else None
def collect_all_results(client: DocmindClient, task_id: str, layout_step_size: int = 50) -> List[Dict]:
"""收集所有解析结果"""
all_layouts = []
layout_num = 0
while True:
result_data = get_result(client, task_id, layout_num, layout_step_size)
if not result_data:
break
layouts = result_data.get("layouts", [])
if not layouts:
break
all_layouts.extend(layouts)
layout_num += len(layouts)
if len(layouts) < layout_step_size:
break
return all_layouts
# ===================== 文本处理 =====================
def normalize_text(text: str) -> str:
text = text.replace("\r", "\n")
text = text.replace(" ", " ")
text = re.sub(r"\n+", "\n", text)
text = re.sub(r"[ \t]+", " ", text)
return text.strip()
def get_page(layout: Dict) -> int:
return layout.get("pageNum", layout.get("pageNumber", 0))
def get_text(layout: Dict) -> str:
text = normalize_text(layout.get("text", ""))
if text:
return text
return normalize_text(layout.get("markdownContent", ""))
# ===================== 布局类型判断 =====================
def is_title(layout: Dict) -> bool:
return layout.get("type") == "title" or layout.get("subType") in TITLE_SUBTYPES
def is_text(layout: Dict) -> bool:
return layout.get("type") == "text" and layout.get("subType", "none") in TEXT_SUBTYPES
def is_figure(layout: Dict) -> bool:
return layout.get("type") in FIGURE_TYPES or layout.get("subType") in FIGURE_SUBTYPES
def is_table(layout: Dict) -> bool:
return layout.get("type") == "table"
def is_toc_layout(layout: Dict) -> bool:
text = get_text(layout)
if text in TOC_TITLES:
return True
if get_page(layout) == 1 and re.match(r"^\d+(\.\d+)*\s+.+[.。…]{2,}\s*\d+$", text):
return True
return False
def extract_table_text(layout: Dict) -> str:
rows = []
for cell in layout.get("cells", []):
texts = []
for cell_layout in cell.get("layouts", []):
cell_text = normalize_text(cell_layout.get("text", ""))
if cell_text:
texts.append(cell_text)
if texts:
rows.append(" ".join(texts))
return "\n".join(rows).strip()
# ===================== 结构层:目录树 =====================
def build_structure_nodes(layouts: List[Dict]) -> List[Dict]:
nodes = []
for layout in layouts:
if not is_title(layout):
continue
text = get_text(layout)
if not text or text in TOC_TITLES:
continue
nodes.append(
{
"unique_id": layout.get("uniqueId"),
"page": get_page(layout),
"index": layout.get("index", 0),
"level": layout.get("level", 0),
"title": text,
"type": layout.get("type"),
"sub_type": layout.get("subType"),
}
)
return nodes
# ===================== 语义层:章节内容 =====================
def update_section_path(section_stack: List[Dict], layout: Dict) -> List[Dict]:
level = layout.get("level", 0)
title = get_text(layout)
while section_stack and section_stack[-1]["level"] >= level:
section_stack.pop()
section_stack.append(
{
"level": level,
"title": title,
"page": get_page(layout),
"unique_id": layout.get("uniqueId"),
}
)
return section_stack
def section_path_titles(section_stack: List[Dict]) -> List[str]:
return [item["title"] for item in section_stack]
def flush_text_block(blocks: List[Dict], semantic_blocks: List[Dict], block_id: int) -> int:
if not blocks:
return block_id
texts = [item["text"] for item in blocks if item["text"]]
merged_text = "\n".join(texts).strip()
if not merged_text:
return block_id
semantic_blocks.append(
{
"semantic_id": f"semantic-{block_id}",
"block_type": "section_text",
"page_start": min(item["page"] for item in blocks),
"page_end": max(item["page"] for item in blocks),
"section_path": blocks[0]["section_path"],
"section_level": blocks[0]["section_level"],
"section_title": blocks[0]["section_title"],
"source_ids": [item["unique_id"] for item in blocks if item.get("unique_id")],
"text": merged_text,
}
)
return block_id + 1
def build_semantic_blocks(layouts: List[Dict]) -> List[Dict]:
semantic_blocks = []
section_stack = []
pending_text_blocks = []
block_id = 1
skip_toc_page = False
for layout in layouts:
text = get_text(layout)
page = get_page(layout)
if is_toc_layout(layout):
skip_toc_page = True
continue
if skip_toc_page and page == 1:
continue
if skip_toc_page and page != 1:
skip_toc_page = False
if is_title(layout):
block_id = flush_text_block(pending_text_blocks, semantic_blocks, block_id)
pending_text_blocks = []
section_stack = update_section_path(section_stack, layout)
continue
section_path = section_path_titles(section_stack)
section_title = section_path[-1] if section_path else "未分类"
section_level = len(section_path)
if is_table(layout):
block_id = flush_text_block(pending_text_blocks, semantic_blocks, block_id)
pending_text_blocks = []
table_text = extract_table_text(layout)
if table_text:
semantic_blocks.append(
{
"semantic_id": f"semantic-{block_id}",
"block_type": "table",
"page_start": page,
"page_end": page,
"section_path": section_path,
"section_level": section_level,
"section_title": section_title,
"source_ids": [layout.get("uniqueId")],
"text": table_text,
}
)
block_id += 1
continue
if is_figure(layout):
block_id = flush_text_block(pending_text_blocks, semantic_blocks, block_id)
pending_text_blocks = []
if text:
semantic_blocks.append(
{
"semantic_id": f"semantic-{block_id}",
"block_type": "figure",
"page_start": page,
"page_end": page,
"section_path": section_path,
"section_level": section_level,
"section_title": section_title,
"source_ids": [layout.get("uniqueId")],
"text": text,
}
)
block_id += 1
continue
if is_text(layout) and text:
pending_text_blocks.append(
{
"page": page,
"text": text,
"unique_id": layout.get("uniqueId"),
"section_path": section_path,
"section_level": section_level,
"section_title": section_title,
}
)
flush_text_block(pending_text_blocks, semantic_blocks, block_id)
return semantic_blocks
# ===================== 检索层:向量 chunks =====================
def split_text_with_overlap(text: str, max_chars: int, overlap_chars: int) -> List[str]:
text = text.strip()
if len(text) <= max_chars:
return [text] if text else []
parts = []
start = 0
while start < len(text):
end = min(len(text), start + max_chars)
parts.append(text[start:end].strip())
if end >= len(text):
break
start = max(0, end - overlap_chars)
return [part for part in parts if part]
def build_vector_chunks(
semantic_blocks: List[Dict],
doc_id: str,
doc_title: str,
max_chars: int,
overlap_chars: int,
) -> List[Dict]:
vector_chunks = []
chunk_index = 1
for block in semantic_blocks:
pieces = split_text_with_overlap(block["text"], max_chars, overlap_chars)
for piece_index, piece in enumerate(pieces, start=1):
if block["section_path"]:
header = f"标准:{doc_title}\n章节:{' > '.join(block['section_path'])}\n\n"
else:
header = f"标准:{doc_title}\n\n"
vector_chunks.append(
{
"doc_id": doc_id,
"doc_title": doc_title,
"chunk_id": f"chunk-{chunk_index}",
"chunk_index": chunk_index,
"semantic_id": block["semantic_id"],
"chunk_type": block["block_type"],
"piece_index": piece_index,
"page_start": block["page_start"],
"page_end": block["page_end"],
"section_path": block["section_path"],
"section_level": block["section_level"],
"section_title": block["section_title"],
"source_ids": block["source_ids"],
"text": piece,
"embedding_text": header + piece,
}
)
chunk_index += 1
return vector_chunks
# ===================== 主转换函数 =====================
def convert_layouts(
layouts: List[Dict],
doc_id: str,
doc_title: str,
max_chars: int,
overlap_chars: int,
) -> Dict:
structure_nodes = build_structure_nodes(layouts)
semantic_blocks = build_semantic_blocks(layouts)
vector_chunks = build_vector_chunks(
semantic_blocks,
doc_id=doc_id,
doc_title=doc_title,
max_chars=max_chars,
overlap_chars=overlap_chars,
)
return {
"doc_id": doc_id,
"doc_title": doc_title,
"structure_nodes": structure_nodes,
"semantic_blocks": semantic_blocks,
"vector_chunks": vector_chunks,
}
# ===================== CLI 入口 =====================
def main() -> None:
parser = argparse.ArgumentParser(description="阿里云文档智能解析 PDF输出三层结构 chunks")
parser.add_argument("pdf_path", help="PDF 文件路径")
parser.add_argument("--out", default="vector_chunks.json", help="输出 JSON 文件路径")
parser.add_argument("--layouts-out", dest="layouts_output", help="输出原始 layouts JSON")
parser.add_argument("--doc-id", default="GB14747-2006", help="文档 ID")
parser.add_argument("--doc-title", default="GB 14747—2006 儿童三轮车安全要求", help="文档标题")
parser.add_argument("--max-chars", type=int, default=MAX_CHARS, help="单个检索 chunk 最大字符数")
parser.add_argument("--overlap-chars", type=int, default=OVERLAP_CHARS, help="相邻检索 chunk 重叠字符数")
parser.add_argument("--poll-interval", type=int, default=5, help="轮询间隔(秒)")
args = parser.parse_args()
pdf_path = Path(args.pdf_path).expanduser().resolve()
if not pdf_path.exists():
raise FileNotFoundError(f"PDF 文件不存在: {pdf_path}")
# 1. 提交阿里云任务
client = init_client()
print(f"提交任务: {pdf_path}")
task_id = submit_job(client, str(pdf_path))
print(f"任务 ID: {task_id}")
# 2. 等待完成
print("等待任务完成...")
if not wait_for_completion(client, task_id, args.poll_interval):
print("任务失败,退出")
return
# 3. 获取 layouts
print("获取解析结果...")
layouts = collect_all_results(client, task_id)
print(f"获取到 {len(layouts)} 个布局块")
# 4. 输出原始 layouts可选
if args.layouts_output:
layouts_path = Path(args.layouts_output).expanduser().resolve()
layouts_path.write_text(json.dumps(layouts, ensure_ascii=False, indent=2), encoding="utf-8")
print(f"原始 layouts 已写入: {layouts_path}")
# 5. 转换为三层结构
print("转换为三层结构...")
data = convert_layouts(
layouts,
doc_id=args.doc_id,
doc_title=args.doc_title,
max_chars=args.max_chars,
overlap_chars=args.overlap_chars,
)
# 6. 输出结果
output_path = Path(args.out).expanduser().resolve()
output_path.write_text(json.dumps(data, ensure_ascii=False, indent=2), encoding="utf-8")
print(f"结构层节点数: {len(data['structure_nodes'])}")
print(f"语义层块数: {len(data['semantic_blocks'])}")
print(f"检索层块数: {len(data['vector_chunks'])}")
print(f"输出文件: {output_path}")
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,115 @@
"""Rebuild the migrated Milvus collection from saved vector chunks."""
from __future__ import annotations
import argparse
import json
from pathlib import Path
from pymilvus import Collection, CollectionSchema, DataType, FieldSchema, connections, utility
DEFAULT_COLLECTION = "regulations_dense_1024_v2"
DEFAULT_DIM = 1024
def build_collection(name: str, dim: int) -> Collection:
"""Create the migrated Milvus collection from scratch."""
if utility.has_collection(name):
utility.drop_collection(name)
schema = CollectionSchema(
fields=[
FieldSchema(name="id", dtype=DataType.VARCHAR, max_length=128, is_primary=True, auto_id=False),
FieldSchema(name="doc_id", dtype=DataType.VARCHAR, max_length=64),
FieldSchema(name="doc_title", dtype=DataType.VARCHAR, max_length=256),
FieldSchema(name="chunk_id", dtype=DataType.VARCHAR, max_length=128),
FieldSchema(name="chunk_index", dtype=DataType.INT64),
FieldSchema(name="piece_index", dtype=DataType.INT64),
FieldSchema(name="text", dtype=DataType.VARCHAR, max_length=65535),
FieldSchema(name="embedding_text", dtype=DataType.VARCHAR, max_length=65535),
FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=dim),
FieldSchema(name="semantic_id", dtype=DataType.VARCHAR, max_length=128),
FieldSchema(name="chunk_type", dtype=DataType.VARCHAR, max_length=64),
FieldSchema(name="page_start", dtype=DataType.INT64),
FieldSchema(name="page_end", dtype=DataType.INT64),
FieldSchema(name="section_level", dtype=DataType.INT64),
FieldSchema(name="source_ids", dtype=DataType.VARCHAR, max_length=4096),
FieldSchema(name="section_path", dtype=DataType.VARCHAR, max_length=4096),
FieldSchema(name="section_title", dtype=DataType.VARCHAR, max_length=512),
FieldSchema(name="metadata_json", dtype=DataType.VARCHAR, max_length=65535),
FieldSchema(name="created_at", dtype=DataType.INT64),
],
description="Dense-only regulations index",
enable_dynamic_field=False,
)
collection = Collection(name=name, schema=schema)
collection.create_index(
field_name="embedding",
index_params={
"metric_type": "COSINE",
"index_type": "IVF_FLAT",
"params": {"nlist": 128},
},
)
return collection
def load_chunks(payload_path: Path) -> list[dict]:
"""Load vector chunks emitted by the Aliyun parser pipeline."""
payload = json.loads(payload_path.read_text(encoding="utf-8"))
if isinstance(payload, dict):
chunks = payload.get("vector_chunks", [])
else:
chunks = payload
if not isinstance(chunks, list):
raise ValueError("vector chunk payload must be a list or a dict containing vector_chunks")
return chunks
def main() -> None:
"""Rebuild the target collection from a vector chunk payload."""
parser = argparse.ArgumentParser(description="Rebuild the migrated Milvus collection.")
parser.add_argument("--host", default="127.0.0.1", help="Milvus host")
parser.add_argument("--port", default="19530", help="Milvus port")
parser.add_argument("--collection", default=DEFAULT_COLLECTION, help="Milvus collection name")
parser.add_argument("--dim", type=int, default=DEFAULT_DIM, help="Embedding dimension")
parser.add_argument("--payload", required=True, help="Path to vector_chunks.json or a compatible JSON file")
args = parser.parse_args()
connections.connect("default", host=args.host, port=args.port)
collection = build_collection(args.collection, args.dim)
chunks = load_chunks(Path(args.payload))
if not chunks:
print("No vector chunks found; collection was created but remains empty.")
return
data = [
[chunk["chunk_id"] for chunk in chunks],
[chunk["doc_id"] for chunk in chunks],
[chunk["doc_title"] for chunk in chunks],
[chunk["chunk_id"] for chunk in chunks],
[int(chunk.get("chunk_index", 0) or 0) for chunk in chunks],
[int(chunk.get("piece_index", 0) or 0) for chunk in chunks],
[str(chunk.get("text", ""))[:65535] for chunk in chunks],
[str(chunk.get("embedding_text", chunk.get("text", "")))[:65535] for chunk in chunks],
[chunk["embedding"] for chunk in chunks],
[str(chunk.get("semantic_id", "")) for chunk in chunks],
[str(chunk.get("chunk_type", "")) for chunk in chunks],
[int(chunk.get("page_start", 0) or 0) for chunk in chunks],
[int(chunk.get("page_end", 0) or 0) for chunk in chunks],
[int(chunk.get("section_level", 0) or 0) for chunk in chunks],
[json.dumps(chunk.get("source_ids", []), ensure_ascii=False) for chunk in chunks],
[json.dumps(chunk.get("section_path", []), ensure_ascii=False) for chunk in chunks],
[str(chunk.get("section_title", "")) for chunk in chunks],
[json.dumps(chunk, ensure_ascii=False) for chunk in chunks],
[int(chunk.get("created_at", 0) or 0) for chunk in chunks],
]
collection.insert(data)
collection.flush()
collection.load()
print(f"Rebuilt collection {args.collection} with {len(chunks)} chunks.")
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,122 @@
-- 法规文档向量检索系统数据库表结构
-- PostgreSQL
-- ==================== 文档表 ====================
CREATE TABLE documents (
id SERIAL PRIMARY KEY,
doc_id VARCHAR(128) UNIQUE NOT NULL, -- 文档唯一标识,如 "GB14747-2006"
title VARCHAR(512) NOT NULL, -- 文档标题
doc_type VARCHAR(32), -- 文档类型:标准/法规/规范
standard_number VARCHAR(64), -- 标准编号:如 "GB 14747-2006"
publish_date DATE, -- 发布日期
implement_date DATE, -- 实施日期
status VARCHAR(32), -- 状态:现行/废止/修订
source_url VARCHAR(512), -- 来源 URL
file_path VARCHAR(512), -- 本地 PDF 文件路径
file_size INT, -- 文件大小(字节)
upload_time TIMESTAMP DEFAULT NOW(), -- 上传时间
created_at TIMESTAMP DEFAULT NOW(),
updated_at TIMESTAMP DEFAULT NOW()
);
COMMENT ON TABLE documents IS '文档元数据表';
COMMENT ON COLUMN documents.doc_id IS '文档唯一标识,用于关联 Milvus 和其他表';
COMMENT ON COLUMN documents.standard_number IS '标准编号,如 GB 14747-2006';
-- ==================== 章节结构表 ====================
CREATE TABLE sections (
id SERIAL PRIMARY KEY,
doc_id VARCHAR(128) NOT NULL,
unique_id VARCHAR(64) NOT NULL, -- 阿里云返回的唯一标识
level INT NOT NULL, -- 层级1, 2, 3...
title VARCHAR(512) NOT NULL, -- 章节标题
page INT, -- 所在页码
index INT, -- 页内顺序
parent_id INT, -- 父章节 ID树形结构
created_at TIMESTAMP DEFAULT NOW(),
CONSTRAINT fk_sections_doc_id FOREIGN KEY (doc_id) REFERENCES documents(doc_id),
CONSTRAINT fk_sections_parent_id FOREIGN KEY (parent_id) REFERENCES sections(id),
CONSTRAINT uq_sections_doc_unique UNIQUE (doc_id, unique_id)
);
COMMENT ON TABLE sections IS '章节结构表,用于目录导航';
COMMENT ON COLUMN sections.parent_id IS '父章节 ID构建树形结构';
COMMENT ON COLUMN sections.level IS '层级深度1 为最顶层';
-- ==================== 语义块表 ====================
CREATE TABLE semantic_blocks (
id SERIAL PRIMARY KEY,
doc_id VARCHAR(128) NOT NULL,
semantic_id VARCHAR(64) NOT NULL, -- 语义块唯一标识
block_type VARCHAR(32) NOT NULL, -- 类型section_text/table/figure
page_start INT NOT NULL, -- 起始页码
page_end INT NOT NULL, -- 结束页码
section_id INT, -- 所属章节
section_title VARCHAR(512), -- 章节标题(冗余,方便查询)
section_level INT, -- 章节层级
source_ids JSONB, -- 原始 layout IDsJSON 数组)
text TEXT NOT NULL, -- 完整内容(未被切分)
created_at TIMESTAMP DEFAULT NOW(),
CONSTRAINT fk_semantic_blocks_doc_id FOREIGN KEY (doc_id) REFERENCES documents(doc_id),
CONSTRAINT fk_semantic_blocks_section_id FOREIGN KEY (section_id) REFERENCES sections(id),
CONSTRAINT uq_semantic_blocks_doc_semantic UNIQUE (doc_id, semantic_id)
);
COMMENT ON TABLE semantic_blocks IS '语义块表,用于邻域扩展,恢复完整内容';
COMMENT ON COLUMN semantic_blocks.block_type IS '类型section_text正文、table表格、figure图示';
COMMENT ON COLUMN semantic_blocks.source_ids IS '原始阿里云 layout 的 uniqueId 数组';
COMMENT ON COLUMN semantic_blocks.text IS '完整语义内容,未被切分';
-- ==================== 向量块元数据表 ====================
CREATE TABLE vector_chunks (
id SERIAL PRIMARY KEY,
doc_id VARCHAR(128) NOT NULL,
chunk_id VARCHAR(64) NOT NULL, -- Milvus 主键
semantic_id VARCHAR(64) NOT NULL, -- 关联语义块
chunk_index INT NOT NULL, -- 切片序号(全局)
piece_index INT, -- 同语义块内的切片序号
page_start INT,
page_end INT,
section_title VARCHAR(512),
text VARCHAR(2048), -- 切片文本(可选,缩短版用于展示)
source_ids JSONB, -- 原始 layout IDsJSON 数组)
created_at TIMESTAMP DEFAULT NOW(),
CONSTRAINT fk_vector_chunks_doc_id FOREIGN KEY (doc_id) REFERENCES documents(doc_id),
CONSTRAINT fk_vector_chunks_semantic_id FOREIGN KEY (doc_id, semantic_id)
REFERENCES semantic_blocks(doc_id, semantic_id),
CONSTRAINT uq_vector_chunks_doc_chunk UNIQUE (doc_id, chunk_id)
);
COMMENT ON TABLE vector_chunks IS '向量块元数据表,用于快速关联查询';
COMMENT ON COLUMN vector_chunks.chunk_id IS 'Milvus 向量库主键';
COMMENT ON COLUMN vector_chunks.piece_index IS '同语义块内的切片序号,用于按序拼接';
-- ==================== 索引 ====================
CREATE INDEX idx_sections_doc_id ON sections(doc_id);
CREATE INDEX idx_sections_parent_id ON sections(parent_id);
CREATE INDEX idx_sections_level ON sections(level);
CREATE INDEX idx_semantic_blocks_doc_id ON semantic_blocks(doc_id);
CREATE INDEX idx_semantic_blocks_section_id ON semantic_blocks(section_id);
CREATE INDEX idx_semantic_blocks_block_type ON semantic_blocks(block_type);
CREATE INDEX idx_semantic_blocks_semantic_id ON semantic_blocks(semantic_id);
CREATE INDEX idx_vector_chunks_doc_id ON vector_chunks(doc_id);
CREATE INDEX idx_vector_chunks_semantic_id ON vector_chunks(semantic_id);
CREATE INDEX idx_vector_chunks_chunk_id ON vector_chunks(chunk_id);
-- ==================== 触发器:自动更新 updated_at ====================
CREATE OR REPLACE FUNCTION update_updated_at()
RETURNS TRIGGER AS $$
BEGIN
NEW.updated_at = NOW();
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
CREATE TRIGGER tr_documents_updated_at
BEFORE UPDATE ON documents
FOR EACH ROW EXECUTE FUNCTION update_updated_at();

View File

@@ -0,0 +1,327 @@
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
将 vector_chunks.json 向量化并上传到 Milvus 和 PostgreSQL
使用中转站的 OpenAI 兼容 API
"""
import argparse
import json
import time
from pathlib import Path
from typing import List, Dict
import psycopg2
from psycopg2.extras import execute_values
from pymilvus import (
connections,
Collection,
FieldSchema,
CollectionSchema,
DataType,
utility,
)
from openai import OpenAI
# ===================== 配置 =====================
# 中转站配置
RELAY_BASE_URL = "http://6.86.80.4:30080/v1"
RELAY_API_KEY = "sk-5HeY7gfSIlyZMacfuXOf5cphpymsNqufEu1ou4U3avbULcyY"
EMBEDDING_MODEL = "text-embedding-v3" # 中转站支持的 embedding 模型
# Milvus 配置
MILVUS_HOST = "localhost"
MILVUS_PORT = "19530"
COLLECTION_NAME = "regulation_chunks"
# PostgreSQL 配置
PG_HOST = "6.86.80.10"
PG_PORT = 5432
PG_USER = "postgresql"
PG_PASSWORD = "postgresql123456"
PG_DATABASE = "postgres"
# ===================== Embedding =====================
def get_openai_client(api_key: str, base_url: str) -> OpenAI:
"""创建 OpenAI 客户端连接到中转站"""
return OpenAI(api_key=api_key, base_url=base_url)
def get_embeddings_batch(client: OpenAI, texts: List[str], batch_size: int = 10) -> List[List[float]]:
"""批量获取文本向量"""
all_embeddings = []
for i in range(0, len(texts), batch_size):
batch = texts[i:i + batch_size]
print(f"Embedding batch {i // batch_size + 1}/{(len(texts) - 1) // batch_size + 1}...")
response = client.embeddings.create(
model=EMBEDDING_MODEL,
input=batch,
)
embeddings = [item.embedding for item in response.data]
all_embeddings.extend(embeddings)
return all_embeddings
# ===================== Milvus =====================
def init_milvus(host: str, port: str):
connections.connect("default", host=host, port=port)
print(f"已连接 Milvus: {host}:{port}")
def create_collection(name: str, dim: int) -> Collection:
"""创建或获取 collection"""
if utility.has_collection(name):
print(f"Collection '{name}' 已存在,删除重建")
utility.drop_collection(name)
fields = [
FieldSchema(name="chunk_id", dtype=DataType.VARCHAR, max_length=64, is_primary=True),
FieldSchema(name="doc_id", dtype=DataType.VARCHAR, max_length=128),
FieldSchema(name="doc_title", dtype=DataType.VARCHAR, max_length=512),
FieldSchema(name="chunk_index", dtype=DataType.INT64),
FieldSchema(name="semantic_id", dtype=DataType.VARCHAR, max_length=64),
FieldSchema(name="chunk_type", dtype=DataType.VARCHAR, max_length=32),
FieldSchema(name="page_start", dtype=DataType.INT64),
FieldSchema(name="page_end", dtype=DataType.INT64),
FieldSchema(name="section_title", dtype=DataType.VARCHAR, max_length=512),
FieldSchema(name="text", dtype=DataType.VARCHAR, max_length=2048),
FieldSchema(name="source_ids", dtype=DataType.VARCHAR, max_length=4096), # JSON 字符串
FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=dim),
]
schema = CollectionSchema(fields, description="法规文档检索 chunks")
collection = Collection(name, schema)
# 创建向量索引IVF_FLAT适合中小规模
index_params = {
"metric_type": "COSINE",
"index_type": "IVF_FLAT",
"params": {"nlist": 128},
}
collection.create_index("embedding", index_params)
print(f"Collection '{name}' 创建完成,索引已建立")
return collection
def insert_chunks(collection: Collection, chunks: List[Dict], embeddings: List[List[float]]):
"""插入 chunks 到 Milvus"""
data = [
[c["chunk_id"] for c in chunks],
[c["doc_id"] for c in chunks],
[c["doc_title"] for c in chunks],
[c["chunk_index"] for c in chunks],
[c["semantic_id"] for c in chunks],
[c["chunk_type"] for c in chunks],
[c["page_start"] for c in chunks],
[c["page_end"] for c in chunks],
[c["section_title"] for c in chunks],
[c["text"] for c in chunks],
[json.dumps(c.get("source_ids", [])) for c in chunks], # JSON 字符串
embeddings,
]
collection.insert(data)
collection.flush()
print(f"已插入 {len(chunks)} 个 chunks")
def load_collection(collection: Collection):
"""加载 collection 到内存(搜索前必须)"""
collection.load()
print(f"Collection 已加载到内存")
# ===================== PostgreSQL =====================
def get_pg_connection(host: str, port: int, user: str, password: str, database: str):
"""获取 PostgreSQL 连接"""
conn = psycopg2.connect(
host=host,
port=port,
user=user,
password=password,
database=database,
)
print(f"已连接 PostgreSQL: {host}:{port}/{database}")
return conn
def insert_chunks_to_pg(conn, chunks: List[Dict], doc_data: Dict):
"""插入 chunks 和相关数据到 PostgreSQL"""
cursor = conn.cursor()
try:
# 1. 插入文档
cursor.execute("""
INSERT INTO documents (doc_id, title, standard_number, upload_time)
VALUES (%s, %s, %s, NOW())
ON CONFLICT (doc_id) DO UPDATE SET title = EXCLUDED.title, updated_at = NOW()
""", (doc_data["doc_id"], doc_data["doc_title"], doc_data.get("standard_number")))
# 2. 插入语义块
semantic_blocks = doc_data.get("semantic_blocks", [])
if semantic_blocks:
block_rows = [
(
doc_data["doc_id"],
block["semantic_id"],
block["block_type"],
block["page_start"],
block["page_end"],
block.get("section_title"),
block.get("section_level"),
json.dumps(block.get("source_ids", [])),
block["text"],
)
for block in semantic_blocks
]
execute_values(
cursor,
"""
INSERT INTO semantic_blocks
(doc_id, semantic_id, block_type, page_start, page_end, section_title, section_level, source_ids, text)
VALUES %s
ON CONFLICT (doc_id, semantic_id) DO UPDATE SET text = EXCLUDED.text
""",
block_rows,
)
print(f"已插入 {len(semantic_blocks)} 个语义块")
# 3. 插入向量块元数据
chunk_rows = [
(
doc_data["doc_id"],
chunk["chunk_id"],
chunk["semantic_id"],
chunk["chunk_index"],
chunk.get("piece_index"),
chunk["page_start"],
chunk["page_end"],
chunk.get("section_title"),
chunk["text"],
json.dumps(chunk.get("source_ids", [])),
)
for chunk in chunks
]
execute_values(
cursor,
"""
INSERT INTO vector_chunks
(doc_id, chunk_id, semantic_id, chunk_index, piece_index, page_start, page_end, section_title, text, source_ids)
VALUES %s
ON CONFLICT (doc_id, chunk_id) DO UPDATE SET text = EXCLUDED.text
""",
chunk_rows,
)
print(f"已插入 {len(chunks)} 个向量块元数据")
conn.commit()
print("PostgreSQL 数据插入完成")
except Exception as e:
conn.rollback()
raise e
finally:
cursor.close()
# ===================== 主流程 =====================
def load_data(file_path: Path) -> Dict:
"""加载 vector_chunks.json返回完整数据"""
data = json.loads(file_path.read_text(encoding="utf-8"))
return data
def upload_to_milvus_and_pg(
chunks_file: str,
api_key: str,
base_url: str,
milvus_host: str,
milvus_port: str,
collection_name: str,
batch_size: int,
pg_host: str,
pg_port: int,
pg_user: str,
pg_password: str,
pg_database: str,
):
# 1. 加载完整数据
chunks_path = Path(chunks_file).expanduser().resolve()
if not chunks_path.exists():
raise FileNotFoundError(f"文件不存在: {chunks_path}")
data = load_data(chunks_path)
chunks = data.get("vector_chunks", [])
if not chunks:
raise ValueError("vector_chunks 为空")
print(f"加载 {len(chunks)} 个 chunks")
# 2. 初始化连接
client = get_openai_client(api_key, base_url)
init_milvus(milvus_host, milvus_port)
pg_conn = get_pg_connection(pg_host, pg_port, pg_user, pg_password, pg_database)
# 3. 获取 embeddings
texts = [c["embedding_text"] for c in chunks]
embeddings = get_embeddings_batch(client, texts, batch_size)
print(f"生成 {len(embeddings)} 个向量")
# 4. 获取 embedding 维度
embedding_dim = len(embeddings[0])
print(f"Embedding 维度: {embedding_dim}")
# 5. 创建 collection 并插入 Milvus
collection = create_collection(collection_name, embedding_dim)
insert_chunks(collection, chunks, embeddings)
load_collection(collection)
# 6. 插入 PostgreSQL
insert_chunks_to_pg(pg_conn, chunks, data)
# 7. 关闭连接
pg_conn.close()
print("上传完成!")
# ===================== CLI =====================
def main():
parser = argparse.ArgumentParser(description="将 vector_chunks 向量化并上传到 Milvus 和 PostgreSQL")
parser.add_argument("chunks_file", help="vector_chunks.json 文件路径")
parser.add_argument("--api-key", default=RELAY_API_KEY, help="中转站 API Key")
parser.add_argument("--base-url", default=RELAY_BASE_URL, help="中转站 Base URL")
parser.add_argument("--milvus-host", default=MILVUS_HOST, help="Milvus host")
parser.add_argument("--milvus-port", default=MILVUS_PORT, help="Milvus port")
parser.add_argument("--collection", default=COLLECTION_NAME, help="Milvus collection 名称")
parser.add_argument("--batch-size", type=int, default=10, help="Embedding 批量大小中转站限制最大10")
parser.add_argument("--pg-host", default=PG_HOST, help="PostgreSQL host")
parser.add_argument("--pg-port", type=int, default=PG_PORT, help="PostgreSQL port")
parser.add_argument("--pg-user", default=PG_USER, help="PostgreSQL user")
parser.add_argument("--pg-password", default=PG_PASSWORD, help="PostgreSQL password")
parser.add_argument("--pg-database", default=PG_DATABASE, help="PostgreSQL database")
args = parser.parse_args()
upload_to_milvus_and_pg(
chunks_file=args.chunks_file,
api_key=args.api_key,
base_url=args.base_url,
milvus_host=args.milvus_host,
milvus_port=args.milvus_port,
collection_name=args.collection,
batch_size=args.batch_size,
pg_host=args.pg_host,
pg_port=args.pg_port,
pg_user=args.pg_user,
pg_password=args.pg_password,
pg_database=args.pg_database,
)
if __name__ == "__main__":
main()

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,263 @@
# 文档解析与向量检索说明
## 相关文件
- `aliyun_doc_parser.py`:调用阿里云文档智能解析 PDF生成原始 `layouts.json`
- `layouts_to_vector_chunks.py`:把 `layouts.json` 转成适合向量数据库入库的三层结构
- `layouts.json`:阿里云返回的原始布局结果
- `vector_chunks.json`:转换后的结构化输出
## 一、`layouts.json` 的结构
`layouts.json` 顶层是一个数组每个元素代表一个布局块layout。常见字段如下
- `type`:主类型,例如 `title``text``table``figure`
- `subType`:更细的语义类型,例如 `doc_title``para_title``para``picture``pic_title``pic_caption`
- `text`:当前布局块的纯文本
- `markdownContent`:带 markdown 标记的文本
- `pageNum`:页码
- `index`:页内顺序
- `level`:标题层级
- `uniqueId`:布局块唯一标识
- `blocks`:更细粒度的文本与样式信息
- `cells`:表格单元格,仅 `table` 类型存在
这个结构不是简单 OCR 文本流,而是已经带有版面理解和语义分类的结构化数据。
## 二、推荐的三层转换结构
### 1. 结构层 `structure_nodes`
结构层用于恢复文档标题树,不直接作为最终向量检索单元。
示例:
- `1 范围`
- `2 规范性引用文件`
- `3 术语和定义`
- `3.1 儿童三轮车`
- `3.2 轮距`
结构层主要用于给下游 chunk 绑定 `section_path`
### 2. 语义层 `semantic_blocks`
语义层是按文档意义聚合后的内容块,主要分为三类:
- `section_text`:同一章节下连续正文聚合而成
- `table`:表格内容单独成块
- `figure`:图、图名、图注等单独成块
这一层比单 layout 更适合做语义理解,也适合后续做上下文扩展。
### 3. 检索层 `vector_chunks`
检索层是最终写进向量数据库的 chunk。
处理方式:
-`semantic_blocks` 中较短的块直接入库
- 对较长的块按 `max_chars` 再切分
- 相邻切片保留 `overlap_chars` 重叠
- 每个 chunk 都带完整 metadata便于后续过滤、重排和邻域扩展
## 三、当前转换脚本做了什么
`layouts_to_vector_chunks.py` 当前已经实现:
1. 过滤目录页噪声(如 `目次`
2. 根据标题层级维护章节路径
3. 将正文聚合成 `section_text`
4. 将表格单独转成 `table`
5. 将图相关内容单独转成 `figure`
6. 对长文本继续切分为最终 `vector_chunks`
7. 为每个检索 chunk 生成 `embedding_text`
## 四、为什么不要直接按 layout 入库
如果把 `layouts.json` 的每条 layout 直接做向量:
- 颗粒度太碎
- 标题和正文容易分离
- 表格会丢失结构上下文
- 图示信息无法完整表达
- 检索命中结果噪声较大
对于标准文档,最合适的单位通常不是“句子”,而是“条款语义块”。
## 五、建议的入库字段
建议向量数据库每条记录至少保存:
- `embedding_text`:用于生成向量
- `text`:原始 chunk 文本
- `chunk_id`
- `semantic_id`
- `chunk_type``section_text` / `table` / `figure`
- `section_path`
- `section_title`
- `section_level`
- `page_start`
- `page_end`
- `doc_id`
- `doc_title`
- `source_ids`
其中:
- 向量化字段:`embedding_text`
- 展示字段:`text`
- 检索增强字段:其余 metadata
## 六、推荐的检索方式
不要只做最简单的 top-k 向量搜索,建议采用:
**向量召回 + metadata 重排 + 邻域扩展**
### 1. 向量召回
使用 `vector_chunks[*].embedding_text` 做 embedding并在向量数据库中检索 top 10 ~ 15 条。
查询时可以对用户问题做轻微改写,例如:
原问题:
`儿童三轮车的定义是什么?`
可改写为:
`请检索 GB 14747—2006 儿童三轮车安全要求 中关于“儿童三轮车定义”的条款、术语、表格或图示说明。`
这样更适合标准文档检索。
### 2. metadata 重排
向量召回后,根据 metadata 做轻量规则重排。
常见规则:
- `chunk_type == section_text`:对定义类、要求类问题优先级更高
- `section_path` 命中查询关键词:例如查询“定义”时,`术语和定义` 章节优先
- `chunk_type == table`:对“尺寸 / 参数 / 数值 / 对照 / 要求”类问题加权
- `chunk_type == figure`:对“图 / 结构 / 状态 / 示意”类问题加权
### 3. 邻域扩展
检索命中的是最终切片,但回答往往需要更完整上下文。
建议命中某个 `vector_chunk` 后:
1. 优先回捞同一个 `semantic_id` 下的所有 chunk
2. 如果还不够,再补充同 `section_path`、相邻页码或相邻 `chunk_index` 的内容
这样可以恢复完整条款,而不是只给模型一小段碎片。
## 七、不同问题的检索重点
### 1. 定义类问题
例如:
- `儿童三轮车的定义是什么?`
- `轮距是什么意思?`
优先检索:
- `section_text`
- `section_path` 中包含 `术语和定义` 的内容
### 2. 要求类问题
例如:
- `外露突出物有什么要求?`
- `辅助推杆有哪些安全要求?`
优先检索:
- `section_text`
- `table`
### 3. 数值 / 尺寸 / 对照类问题
例如:
- `鞍座到脚蹬距离要求是什么?`
- `哪些项目需要满足规定尺寸?`
优先检索:
- `table`
- `section_text`
### 4. 图示说明类问题
例如:
- `正常乘骑状态是什么意思?`
- `图1表示什么`
优先检索:
- `figure`
- 同章节相邻 `section_text`
## 八、推荐的最终检索流程
建议采用以下固定流程:
1.`vector_chunks.embedding_text` 做 embedding 检索
2. 取 top 10 ~ 15 条候选
3.`chunk_type + section_path` 做规则重排
4.`semantic_id` 为中心回捞完整语义块
5. 选 3 ~ 5 组上下文提供给大模型回答
## 九、给大模型的上下文组织方式
最终不要直接把原始 JSON 扔给模型,建议整理成如下格式:
```text
[命中片段 1]
章节3 术语和定义 > 3.1 儿童三轮车
页码1-2
类型section_text
内容:
......
[命中片段 2]
章节4 要求 > 4.3 外露突出物
页码5
类型section_text
内容:
......
[命中片段 3]
章节5 试验方法
页码8
类型table
内容:
......
```
这种格式更利于模型稳定回答并引用出处。
## 十、转换命令
生成三层结构:
```bash
python3 /home/huaci/dev/ai/SuperMew/tests/layouts_to_vector_chunks.py \
--layouts /home/huaci/dev/ai/SuperMew/tests/layouts.json \
--out /home/huaci/dev/ai/SuperMew/tests/vector_chunks.json
```
自定义切片大小:
```bash
python3 /home/huaci/dev/ai/SuperMew/tests/layouts_to_vector_chunks.py \
--layouts /home/huaci/dev/ai/SuperMew/tests/layouts.json \
--out /home/huaci/dev/ai/SuperMew/tests/vector_chunks.json \
--max-chars 500 \
--overlap-chars 80
```

View File

@@ -3,6 +3,7 @@
from contextlib import asynccontextmanager
from fastapi import FastAPI, Request
from fastapi.encoders import jsonable_encoder
from fastapi.middleware.cors import CORSMiddleware
from fastapi.responses import JSONResponse
from loguru import logger
@@ -11,7 +12,8 @@ from app.api.models import ErrorResponse
from app.api.routes import api_router
from app.config.logging import setup_logging
from app.config.settings import settings
from app.services.llm.llm_factory import LLMFactory
from app.shared.bootstrap import cleanup_runtime_dependencies, preload_runtime_dependencies
from app.shared.errors import VectorStoreSchemaError
# Keep module behavior explicit so the backend flow stays easy to audit.
@@ -24,12 +26,12 @@ async def lifespan(app: FastAPI):
logger.info(f"启动 {settings.app_name} v{settings.app_version}")
logger.info(f"调试模式: {settings.debug}")
logger.info("预加载LLM客户端...")
LLMFactory.preload_clients(["qwen", "deepseek"])
preload_runtime_dependencies()
yield
logger.info("应用关闭,执行清理...")
LLMFactory.cleanup()
cleanup_runtime_dependencies()
app = FastAPI(
@@ -55,16 +57,33 @@ app.add_middleware(
app.include_router(api_router, prefix="/api/v1")
@app.exception_handler(VectorStoreSchemaError)
async def vector_store_schema_exception_handler(request: Request, exc: VectorStoreSchemaError):
"""Return a stable JSON response for vector store schema/runtime errors."""
logger.error(f"向量库 schema 异常: {exc}")
return JSONResponse(
status_code=500,
content=jsonable_encoder(
ErrorResponse(
error="VectorStoreSchemaError",
message=str(exc),
)
),
)
@app.exception_handler(Exception)
async def global_exception_handler(request: Request, exc: Exception):
"""Global exception handler."""
logger.error(f"未处理的异常: {exc}")
return JSONResponse(
status_code=500,
content=ErrorResponse(
error="InternalServerError",
message=str(exc),
).model_dump(),
content=jsonable_encoder(
ErrorResponse(
error="InternalServerError",
message=str(exc),
)
),
)

View File

@@ -6,6 +6,8 @@ from .documents import router as documents_router
from .knowledge import router as knowledge_router
from .agent import router as agent_router
from .status import router as status_router
from .perception import router as perception_router
from .rag import router as rag_router
# Keep package boundaries explicit so backend imports stay predictable.
@@ -18,6 +20,8 @@ api_router.include_router(knowledge_router)
api_router.include_router(agent_router)
api_router.include_router(compliance_router)
api_router.include_router(status_router)
api_router.include_router(perception_router)
api_router.include_router(rag_router)
__all__ = [
"api_router",
@@ -26,4 +30,6 @@ __all__ = [
"agent_router",
"compliance_router",
"status_router",
"perception_router",
"rag_router",
]

View File

@@ -20,7 +20,7 @@ from app.api.models import (
)
from app.config.settings import settings
from app.shared.async_utils import iter_in_thread
from app.shared.bootstrap import get_agent_conversation_service, get_conversation_store
from app.shared.bootstrap import get_agent_conversation_service, get_agent_session_service
# Keep route handlers close to their transport-layer wiring for easier auditing.
@@ -65,7 +65,7 @@ async def chat_with_session(request: ChatRequest):
model=request.model or settings.llm_model,
top_k=request.top_k or settings.rag_top_k,
)
session = get_conversation_store().get_session(session_id)
session = get_agent_session_service().get_session(session_id)
return ChatResponse(
session_id=session_id,
answer=result.answer,
@@ -133,45 +133,52 @@ async def chat_stream(request: ChatRequest):
@router.get("/session/{session_id}", response_model=SessionInfo)
async def get_session_info(session_id: str):
"""Return session info."""
session = get_conversation_store().get_session(session_id)
if not session:
raise HTTPException(status_code=404, detail="会话不存在或已过期")
return SessionInfo(
session_id=session.session_id,
message_count=len(session.messages),
created_at=session.created_at,
updated_at=session.updated_at,
)
try:
session = get_agent_session_service().get_session(session_id)
return SessionInfo(
session_id=session.session_id,
message_count=len(session.messages),
created_at=session.created_at,
updated_at=session.updated_at,
)
except ValueError as exc:
raise HTTPException(status_code=404, detail=str(exc))
@router.get("/session/{session_id}/history")
async def get_session_history(session_id: str, max_turns: int = 5):
"""Return session history."""
session = get_conversation_store().get_session(session_id)
if not session:
raise HTTPException(status_code=404, detail="会话不存在或已过期")
history = [{"role": msg.role, "content": msg.content} for msg in session.messages[-(max_turns * 2):]]
return {"session_id": session_id, "history": history}
try:
history = get_agent_session_service().get_history(session_id=session_id, max_turns=max_turns)
return {"session_id": session_id, "history": history}
except ValueError as exc:
raise HTTPException(status_code=404, detail=str(exc))
@router.delete("/session/{session_id}")
async def delete_session(session_id: str):
"""Delete session."""
if not get_conversation_store().delete_session(session_id):
raise HTTPException(status_code=404, detail="会话不存在")
return {"message": "会话已删除", "session_id": session_id}
try:
get_agent_session_service().delete_session(session_id)
return {"message": "会话已删除", "session_id": session_id}
except ValueError as exc:
raise HTTPException(status_code=404, detail=str(exc))
@router.get("/sessions", response_model=List[SessionInfo])
async def list_sessions():
"""List sessions."""
return [SessionInfo(**item) for item in get_conversation_store().list_sessions()]
return [SessionInfo(**item) for item in get_agent_session_service().list_sessions()]
@router.post("/feedback")
async def submit_feedback(request: FeedbackRequest):
"""Submit feedback."""
session = get_conversation_store().get_session(request.session_id)
if not session:
raise HTTPException(status_code=404, detail="会话不存在")
return {"message": "反馈已提交", "session_id": request.session_id, "message_index": request.message_index}
try:
result = get_agent_session_service().submit_feedback(
session_id=request.session_id,
message_index=request.message_index,
)
return {"message": "反馈已提交", "session_id": result.session_id, "message_index": result.message_index}
except ValueError as exc:
raise HTTPException(status_code=404, detail=str(exc))

View File

@@ -29,14 +29,19 @@ async def search_knowledge(request: SearchRequest):
results=[
SearchResultItem(
id=index + 1,
content=item.content,
content=item.text,
score=item.score,
metadata={
"doc_id": item.doc_id,
"doc_name": item.doc_name,
"doc_title": item.doc_title,
"chunk_id": item.chunk_id,
"chunk_type": item.chunk_type,
"section_title": item.section_title,
"page_number": item.page_number,
"page_start": item.page_start,
"page_end": item.page_end,
"section_level": item.section_level,
"chunk_index": item.chunk_index,
"piece_index": item.piece_index,
**item.metadata,
},
)

View File

@@ -0,0 +1,67 @@
"""Define API routes for perception (regulatory intelligence)."""
from __future__ import annotations
import json
from fastapi import APIRouter, Query
from fastapi.responses import StreamingResponse
from app.shared.bootstrap import get_perception_service
from app.shared.async_utils import iter_in_thread
router = APIRouter(prefix="/perception", tags=["智能感知"])
@router.get("/stats")
async def get_perception_stats():
"""Return KPI statistics for the perception dashboard."""
return get_perception_service().get_stats()
@router.get("/events")
async def list_events(
source: str | None = Query(default=None, description="来源筛选 (MIIT/UN-ECE/ISO/国标委/EUR-Lex/IATF)"),
impact_level: str | None = Query(default=None, description="影响等级 (high/medium/low)"),
limit: int = Query(default=50, ge=1, le=100),
):
"""Return regulatory events with optional filters."""
events = get_perception_service().list_events(
source=source,
impact_level=impact_level,
limit=limit,
)
return {"events": events, "total": len(events)}
@router.get("/events/{event_id}")
async def get_event(event_id: str):
"""Return a single regulatory event by ID."""
event = get_perception_service().get_event(event_id)
if event is None:
from fastapi import HTTPException
raise HTTPException(status_code=404, detail=f"Event {event_id} not found")
return event
@router.post("/events/{event_id}/analyze")
async def analyze_event(event_id: str):
"""Stream SSE impact analysis for a regulatory event."""
service = get_perception_service()
async def event_stream():
async for item in iter_in_thread(service.analyze_event(event_id)):
event_name = item.get("event", "message")
data = item.get("data", "")
if isinstance(data, (dict, list)):
data = json.dumps(data, ensure_ascii=False)
yield f"event: {event_name}\ndata: {data}\n\n"
return StreamingResponse(
event_stream(),
media_type="text/event-stream",
headers={
"Cache-Control": "no-cache",
"X-Accel-Buffering": "no",
},
)

View File

@@ -50,8 +50,8 @@ async def rag_chat(request: RagChatRequest):
{
"id": str(s.get("chunk_id") or s.get("doc_id") or idx + 1),
"score": s.get("score", 0),
"preview": s.get("content", "")[:200],
"doc_name": s.get("doc_name", ""),
"preview": s.get("text", s.get("content", ""))[:200],
"doc_name": s.get("doc_title", s.get("doc_name", "")),
"clause": s.get("section_title", "法规片段"),
"doc_id": s.get("doc_id"),
"download_url": (

View File

@@ -1,7 +1,7 @@
"""Initialize the app.application.agent package."""
from .services import AgentConversationService
from .services import AgentConversationService, AgentSessionFeedbackResult, AgentSessionService
# Keep package boundaries explicit so backend imports stay predictable.
__all__ = ["AgentConversationService"]
__all__ = ["AgentConversationService", "AgentSessionFeedbackResult", "AgentSessionService"]

View File

@@ -1,7 +1,8 @@
"""Implement application-layer logic for services."""
"""Implement application-layer logic for agent services."""
from __future__ import annotations
from dataclasses import dataclass
from typing import Generator
from app.domain.conversation import AnswerGenerator, AnswerResult, ConversationStore
@@ -143,3 +144,48 @@ class AgentConversationService:
)
return session.session_id, event_stream()
@dataclass
class AgentSessionFeedbackResult:
"""Represent the result of storing session feedback."""
session_id: str
message_index: int
class AgentSessionService:
"""Provide application-layer access to session management workflows."""
def __init__(self, *, conversation_store: ConversationStore) -> None:
"""Initialize the Agent Session Service instance."""
self.conversation_store = conversation_store
def get_session(self, session_id: str):
"""Return a session by id or raise when it does not exist."""
session = self.conversation_store.get_session(session_id)
if not session:
raise ValueError("会话不存在或已过期")
return session
def get_history(self, *, session_id: str, max_turns: int = 5) -> list[dict[str, str]]:
"""Return the recent conversation history for a session."""
session = self.get_session(session_id)
return [{"role": msg.role, "content": msg.content} for msg in session.messages[-(max_turns * 2):]]
def delete_session(self, session_id: str) -> None:
"""Delete a session or raise when it does not exist."""
if not self.conversation_store.delete_session(session_id):
raise ValueError("会话不存在")
def list_sessions(self) -> list[dict]:
"""Return the list of visible sessions."""
return self.conversation_store.list_sessions()
def submit_feedback(self, *, session_id: str, message_index: int) -> AgentSessionFeedbackResult:
"""Validate feedback targets and return a normalized feedback result."""
session = self.get_session(session_id)
if message_index < 0 or message_index >= len(session.messages):
raise ValueError("消息索引不存在")
# Preserve the existing API behavior until a persistent feedback store is introduced.
return AgentSessionFeedbackResult(session_id=session_id, message_index=message_index)

View File

@@ -7,16 +7,22 @@ import tempfile
import uuid
import json
from dataclasses import dataclass
from datetime import UTC, datetime
from loguru import logger
from app.config.settings import settings
from app.domain.documents import (
ChunkBuilder,
Document,
DocumentArtifact,
DocumentBinaryStore,
DocumentParser,
DocumentProcessingRun,
DocumentProcessingStore,
DocumentRepository,
DocumentStatus,
DocumentStatusEvent,
ParseArtifactStore,
ParsedDocument,
)
@@ -39,6 +45,7 @@ class DocumentProcessResult:
class DocumentCommandService:
"""Provide the Document Command Service service."""
def __init__(
self,
*,
@@ -49,6 +56,7 @@ class DocumentCommandService:
embedding_provider: EmbeddingProvider,
vector_index: VectorIndex,
parse_artifact_store: ParseArtifactStore | None = None,
document_processing_store: DocumentProcessingStore | None = None,
) -> None:
"""Initialize the Document Command Service instance."""
self.document_repository = document_repository
@@ -58,6 +66,11 @@ class DocumentCommandService:
self.embedding_provider = embedding_provider
self.vector_index = vector_index
self.parse_artifact_store = parse_artifact_store
self.document_processing_store = document_processing_store
def _utcnow(self) -> datetime:
"""Return the current UTC timestamp for persisted processing metadata."""
return datetime.now(UTC)
def _save_parse_artifacts(self, *, doc_id: str, parsed_document: ParsedDocument) -> dict[str, str]:
"""Persist parse artifacts so troubleshooting does not depend on provider retention windows."""
@@ -80,6 +93,143 @@ class DocumentCommandService:
artifact_keys[name] = object_name
return artifact_keys
def _safe_create_processing_run(self, *, doc_id: str, trigger_type: str, generate_summary: bool) -> str | None:
"""Create a processing run record when the optional store is available."""
if not self.document_processing_store:
return None
run = DocumentProcessingRun(
run_id=str(uuid.uuid4()),
doc_id=doc_id,
trigger_type=trigger_type,
run_status="running",
parser_backend=settings.parser_backend,
chunk_backend=settings.chunk_backend,
embedding_model=settings.embedding_model,
metadata={"generate_summary": generate_summary},
)
try:
created = self.document_processing_store.create_run(run)
return created.run_id
except Exception:
logger.warning("DocumentProcessingStore.create_run failed for doc_id={}", doc_id)
return None
def _safe_append_status_event(
self,
*,
doc_id: str,
run_id: str | None,
from_status: str,
to_status: str,
stage: str,
message: str = "",
metadata: dict | None = None,
) -> None:
"""Append a status event without allowing auxiliary persistence failures to abort processing."""
if not self.document_processing_store or not run_id:
return
event = DocumentStatusEvent(
event_id=str(uuid.uuid4()),
doc_id=doc_id,
run_id=run_id,
from_status=from_status,
to_status=to_status,
stage=stage,
message=message,
metadata=metadata or {},
)
try:
self.document_processing_store.append_status_event(event)
except Exception:
logger.warning(
"DocumentProcessingStore.append_status_event failed for doc_id={}, run_id={}",
doc_id,
run_id,
)
def _safe_mark_run_stored(self, *, doc_id: str, run_id: str | None) -> None:
"""Mark the processing run as stored without affecting the main workflow."""
if not self.document_processing_store or not run_id:
return
try:
self.document_processing_store.mark_run_stored(run_id, stored_at=self._utcnow())
except Exception:
logger.warning("DocumentProcessingStore.mark_run_stored failed for doc_id={}, run_id={}", doc_id, run_id)
def _safe_mark_run_parsed(self, *, doc_id: str, run_id: str | None, parsed_document: ParsedDocument) -> None:
"""Persist parse completion details without failing the document pipeline."""
if not self.document_processing_store or not run_id:
return
try:
self.document_processing_store.mark_run_parsed(
run_id,
parser_backend=parsed_document.parser_name,
layout_count=int(parsed_document.metadata.get("layout_count", len(parsed_document.raw_layouts)) or 0),
structure_node_count=len(parsed_document.structure_nodes),
semantic_block_count=len(parsed_document.semantic_blocks),
vector_chunk_count=len(parsed_document.vector_chunks),
parsed_at=self._utcnow(),
metadata={"parse_task_id": parsed_document.metadata.get("task_id", "")},
)
except Exception:
logger.warning("DocumentProcessingStore.mark_run_parsed failed for doc_id={}, run_id={}", doc_id, run_id)
def _safe_replace_processing_artifacts(self, *, doc_id: str, run_id: str | None, artifact_keys: dict[str, str]) -> None:
"""Store artifact references without turning persistence drift into a user-visible failure."""
if not self.document_processing_store or not run_id:
return
artifacts = [
DocumentArtifact(
artifact_id=str(uuid.uuid4()),
doc_id=doc_id,
run_id=run_id,
artifact_type=artifact_type,
object_name=object_name,
content_type="application/json",
byte_size=0,
checksum="",
)
for artifact_type, object_name in artifact_keys.items()
]
try:
self.document_processing_store.replace_artifacts_for_run(run_id, artifacts)
except Exception:
logger.warning(
"DocumentProcessingStore.replace_artifacts_for_run failed for doc_id={}, run_id={}",
doc_id,
run_id,
)
def _safe_mark_run_indexed(self, *, doc_id: str, run_id: str | None, chunk_count: int, index_name: str) -> None:
"""Mark the processing run as indexed without affecting the success path."""
if not self.document_processing_store or not run_id:
return
now = self._utcnow()
try:
self.document_processing_store.mark_run_indexed(
run_id,
chunk_count=chunk_count,
index_name=index_name,
indexed_at=now,
finished_at=now,
)
except Exception:
logger.warning("DocumentProcessingStore.mark_run_indexed failed for doc_id={}, run_id={}", doc_id, run_id)
def _safe_mark_run_failed(self, *, doc_id: str, run_id: str | None, failure_stage: str, error_message: str) -> None:
"""Mark the processing run as failed without masking the original error handling path."""
if not self.document_processing_store or not run_id:
return
try:
self.document_processing_store.mark_run_failed(
run_id,
failure_stage=failure_stage,
error_message=error_message,
finished_at=self._utcnow(),
)
except Exception:
logger.warning("DocumentProcessingStore.mark_run_failed failed for doc_id={}, run_id={}", doc_id, run_id)
def upload_and_process(
self,
*,
@@ -91,11 +241,15 @@ class DocumentCommandService:
regulation_type: str,
version: str,
generate_summary: bool,
trigger_type: str = "upload",
) -> DocumentProcessResult:
"""Handle upload and process for the Document Command Service instance."""
doc_id = doc_id or str(uuid.uuid4())[:8]
final_doc_name = doc_name or file_name
object_name = f"{doc_id}/{file_name}"
run_id: str | None = None
current_status = DocumentStatus.PENDING
current_stage = "store"
document = Document(
doc_id=doc_id,
@@ -109,6 +263,19 @@ class DocumentCommandService:
metadata={"generate_summary": generate_summary},
)
self.document_repository.create(document)
run_id = self._safe_create_processing_run(
doc_id=doc_id,
trigger_type=trigger_type,
generate_summary=generate_summary,
)
self._safe_append_status_event(
doc_id=doc_id,
run_id=run_id,
from_status="",
to_status=DocumentStatus.PENDING.value,
stage="document_created",
message="Document record created",
)
temp_path = ""
try:
@@ -119,6 +286,17 @@ class DocumentCommandService:
metadata={"doc_id": doc_id},
)
self.document_repository.update_status(doc_id, DocumentStatus.STORED)
current_status = DocumentStatus.STORED
current_stage = "parse"
self._safe_mark_run_stored(doc_id=doc_id, run_id=run_id)
self._safe_append_status_event(
doc_id=doc_id,
run_id=run_id,
from_status=DocumentStatus.PENDING.value,
to_status=DocumentStatus.STORED.value,
stage="store",
message="Source file stored",
)
suffix = os.path.splitext(file_name)[1]
with tempfile.NamedTemporaryFile(delete=False, suffix=suffix) as temp_file:
@@ -130,7 +308,13 @@ class DocumentCommandService:
doc_id=doc_id,
doc_name=final_doc_name,
)
artifact_keys = self._save_parse_artifacts(doc_id=doc_id, parsed_document=parsed_document)
self._safe_mark_run_parsed(doc_id=doc_id, run_id=run_id, parsed_document=parsed_document)
artifact_keys: dict[str, str] = {}
try:
artifact_keys = self._save_parse_artifacts(doc_id=doc_id, parsed_document=parsed_document)
except Exception:
logger.warning("Parse artifact binary persistence failed for doc_id={}", doc_id)
self.document_repository.update_status(
doc_id,
DocumentStatus.PARSED,
@@ -146,6 +330,18 @@ class DocumentCommandService:
"processing_stage": "parsed",
},
)
current_status = DocumentStatus.PARSED
current_stage = "embed"
self._safe_replace_processing_artifacts(doc_id=doc_id, run_id=run_id, artifact_keys=artifact_keys)
self._safe_append_status_event(
doc_id=doc_id,
run_id=run_id,
from_status=DocumentStatus.STORED.value,
to_status=DocumentStatus.PARSED.value,
stage="parse",
message="Document parsed",
metadata={"artifact_count": len(artifact_keys)},
)
if self.parse_artifact_store:
try:
self.parse_artifact_store.save(
@@ -165,6 +361,7 @@ class DocumentCommandService:
raise ValueError("解析完成但没有生成可入库的 chunks")
vectors = self.embedding_provider.embed_texts([chunk.embedding_text for chunk in chunks])
current_stage = "index"
inserted = self.vector_index.upsert(chunks, vectors)
if inserted != len(chunks):
logger.warning("Milvus upsert count mismatched: inserted={}, chunks={}", inserted, len(chunks))
@@ -182,6 +379,23 @@ class DocumentCommandService:
"processing_stage": "indexed",
},
)
current_status = DocumentStatus.INDEXED
index_name = health.get("collection_name", "")
self._safe_mark_run_indexed(
doc_id=doc_id,
run_id=run_id,
chunk_count=len(chunks),
index_name=index_name,
)
self._safe_append_status_event(
doc_id=doc_id,
run_id=run_id,
from_status=DocumentStatus.PARSED.value,
to_status=DocumentStatus.INDEXED.value,
stage="index",
message="Document indexed",
metadata={"chunk_count": len(chunks), "index_name": index_name},
)
stored = self.document_repository.get(doc_id)
return DocumentProcessResult(
doc_id=doc_id,
@@ -194,6 +408,7 @@ class DocumentCommandService:
)
except Exception as exc:
logger.exception("文档处理失败: doc_id={}", doc_id)
failure_stage = current_stage
self.document_repository.update_status(
doc_id,
DocumentStatus.FAILED,
@@ -201,8 +416,23 @@ class DocumentCommandService:
metadata={
"failure_reason": str(exc),
"processing_stage": "failed",
"failure_stage": failure_stage,
},
)
self._safe_mark_run_failed(
doc_id=doc_id,
run_id=run_id,
failure_stage=failure_stage,
error_message=str(exc),
)
self._safe_append_status_event(
doc_id=doc_id,
run_id=run_id,
from_status=current_status.value,
to_status=DocumentStatus.FAILED.value,
stage=failure_stage,
message=str(exc),
)
return DocumentProcessResult(
doc_id=doc_id,
doc_name=final_doc_name,
@@ -235,6 +465,11 @@ class DocumentCommandService:
self.parse_artifact_store.delete(doc_id)
except Exception:
logger.warning("ParseArtifactStore delete failed for doc_id={}", doc_id)
if self.document_processing_store:
try:
self.document_processing_store.delete_by_document(doc_id)
except Exception:
logger.warning("DocumentProcessingStore delete failed for doc_id={}", doc_id)
self.document_repository.delete(doc_id)
return True
@@ -253,6 +488,7 @@ class DocumentCommandService:
regulation_type=document.regulation_type,
version=document.version,
generate_summary=bool(document.metadata.get("generate_summary", False)),
trigger_type="retry",
)
@@ -272,7 +508,7 @@ class DocumentQueryService:
"""Return documents with real-time state from Milvus as the authoritative source.
Algorithm:
1. Query Milvus for all doc metadata (doc_id, doc_name, chunk_count, …).
1. Query Milvus for all doc metadata (doc_id, doc_title, chunk_count, …).
2. Load JSON/PG metadata records and index them by doc_id.
3. Merge: Milvus-present docs get status=INDEXED and live chunk_count;
metadata-only docs with status=INDEXED are demoted to FAILED.
@@ -300,8 +536,8 @@ class DocumentQueryService:
doc.chunk_count = row["chunk_count"]
doc.status = DocumentStatus.INDEXED
# Backfill fields that may be missing from older JSON records.
if not doc.doc_name and row.get("doc_name"):
doc.doc_name = row["doc_name"]
if not doc.doc_name and row.get("doc_title"):
doc.doc_name = row["doc_title"]
if not doc.regulation_type and row.get("regulation_type"):
doc.regulation_type = row["regulation_type"]
if not doc.version and row.get("version"):
@@ -317,8 +553,8 @@ class DocumentQueryService:
if doc_id not in meta_by_id:
synthetic = Document(
doc_id=doc_id,
doc_name=row.get("doc_name", doc_id),
file_name=row.get("doc_name", doc_id),
doc_name=row.get("doc_title", doc_id),
file_name=row.get("doc_title", doc_id),
object_name="",
content_type="",
size_bytes=0,

View File

@@ -29,11 +29,16 @@ def _reciprocal_rank_fusion(
RetrievedChunk(
chunk_id=chunk_map[ck].chunk_id,
doc_id=chunk_map[ck].doc_id,
doc_name=chunk_map[ck].doc_name,
content=chunk_map[ck].content,
doc_title=chunk_map[ck].doc_title,
text=chunk_map[ck].text,
score=scores[ck],
chunk_type=chunk_map[ck].chunk_type,
section_title=chunk_map[ck].section_title,
page_number=chunk_map[ck].page_number,
page_start=chunk_map[ck].page_start,
page_end=chunk_map[ck].page_end,
section_level=chunk_map[ck].section_level,
chunk_index=chunk_map[ck].chunk_index,
piece_index=chunk_map[ck].piece_index,
metadata=chunk_map[ck].metadata,
)
for ck in sorted_keys

View File

@@ -0,0 +1 @@
"""Perception application package."""

View File

@@ -0,0 +1,143 @@
"""Perception application service — event listing and streaming impact analysis."""
from __future__ import annotations
import json
from typing import Generator
from app.application.knowledge.services import KnowledgeRetrievalService
from app.infrastructure.perception.mock_event_store import MockEventStore
from app.services.llm.llm_factory import get_llm_client
from app.config.settings import settings
_ANALYSIS_SYSTEM_PROMPT = (
"你是汽车行业法规合规专家专注于中国国家标准GB、国际法规UN-ECE、ISO"
"及欧盟法规EUR-Lex的解读与合规建议。回答需专业、简洁、结构清晰。"
)
class PerceptionService:
"""Orchestrate regulatory event queries and streaming impact analysis."""
def __init__(
self,
event_store: MockEventStore,
retrieval_service: KnowledgeRetrievalService,
) -> None:
self._store = event_store
self._retrieval = retrieval_service
# ------------------------------------------------------------------
# Queries
# ------------------------------------------------------------------
def list_events(
self,
*,
source: str | None = None,
impact_level: str | None = None,
limit: int = 50,
) -> list[dict]:
return self._store.filter(source=source, impact_level=impact_level, limit=limit)
def get_event(self, event_id: str) -> dict | None:
return self._store.get(event_id)
def get_stats(self) -> dict:
return self._store.stats()
# ------------------------------------------------------------------
# Streaming analysis
# ------------------------------------------------------------------
def analyze_event(self, event_id: str) -> Generator[dict, None, None]:
"""Yield SSE-ready dicts: sources → content chunks → done."""
event = self._store.get(event_id)
if not event:
yield {"event": "error", "data": f"事件 {event_id} 不存在"}
return
# --- 1. RAG retrieval: find related library documents ---
query = event["title"] + " " + " ".join(event["tags"])
chunks: list = []
affected_docs: list[dict] = []
try:
chunks = self._retrieval.retrieve(query=query, top_k=5)
seen: set[str] = set()
for chunk in chunks:
if chunk.doc_id not in seen:
seen.add(chunk.doc_id)
affected_docs.append(
{
"doc_id": chunk.doc_id,
"doc_title": chunk.doc_title,
"score": round(float(chunk.score), 4),
"snippet": (chunk.text or "")[:180],
"clause": getattr(chunk, "section_title", "") or "",
}
)
except Exception: # noqa: BLE001
pass
yield {"event": "sources", "data": json.dumps(affected_docs, ensure_ascii=False)}
# --- 2. Build context from retrieved chunks ---
context_parts = [
f"[文档{i}: {c.doc_title}]\n{(c.text or '')[:400]}"
for i, c in enumerate(chunks[:5], 1)
]
context = "\n\n".join(context_parts) if context_parts else "(知识库中暂无相关文档)"
# --- 3. Build prompt ---
effective = event.get("effective_at") or "待定"
user_content = f"""请对以下法规动态进行专业影响分析。
【法规动态】
标准编号:{event['standard_code']}
标题:{event['title']}
来源:{event['source_label']}
摘要:{event['summary']}
生效日期:{effective}
分类:{event['category']}
关键词:{', '.join(event['tags'])}
【知识库关联文档】
{context}
请用 Markdown 格式,从以下四个维度进行分析:
## 核心变化
列出本次法规更新最关键的 3-5 项变化(用 - 列表)
## 业务影响
分析对现有产品、认证流程、技术文档的具体影响
## 整改建议
给出优先级排序的行动清单(标注 🔴高 🟡中 🟢低 优先级)
## 时间节点
关键合规时间表与里程碑提醒"""
messages = [
{"role": "system", "content": _ANALYSIS_SYSTEM_PROMPT},
{"role": "user", "content": user_content},
]
# --- 4. Stream LLM response ---
try:
client = get_llm_client(
provider=settings.llm_provider,
model=settings.llm_model,
)
if hasattr(client, "stream_chat"):
for chunk in client.stream_chat(messages):
yield {"event": "content", "data": chunk}
else:
response = client.chat(messages)
yield {"event": "content", "data": response.content or ""}
except Exception as exc: # noqa: BLE001
yield {"event": "error", "data": str(exc)}
return
yield {"event": "done", "data": "{}"}

View File

@@ -33,7 +33,7 @@ class Settings(BaseSettings):
# Keep configuration setup explicit so runtime behavior is easy to reason about.
milvus_host: str = Field(default="6.86.80.8", description="Milvus服务地址")
milvus_port: int = Field(default=19530, description="Milvus服务端口")
milvus_collection: str = Field(default="regulations_dense_1024_v1", description="法规向量集合名称")
milvus_collection: str = Field(default="regulations_dense_1024_v2", description="法规向量集合名称")
milvus_db_name: str = Field(default="default", description="Milvus数据库名称")
# Keep configuration setup explicit so runtime behavior is easy to reason about.
@@ -78,6 +78,7 @@ class Settings(BaseSettings):
chunk_overlap: int = Field(default=50, description="分块重叠大小")
max_file_size_mb: int = Field(default=100, description="最大文件大小(MB)")
document_metadata_path: str = Field(default="backend/data/documents.json", description="文档元数据存储路径")
document_processing_metadata_path: str = Field(default="backend/data/document_processing.json", description="文档处理历史存储路径")
parser_backend: str = Field(default="aliyun", description="解析后端(local/aliyun)")
chunk_backend: str = Field(default="aliyun", description="分块后端(local/aliyun)")
document_repository_backend: str = Field(default="json", description="文档元数据存储后端 (json/postgres)")

View File

@@ -27,7 +27,7 @@ class Settings(BaseSettings):
# Milvus
milvus_host: str = "6.86.80.8"
milvus_port: int = 19530
milvus_collection: str = "regulations_dense_1024_v1"
milvus_collection: str = "regulations_dense_1024_v2"
# LLM / embedding defaults aligned with the migrated backend path.
llm_model: str = "qwen-max"
@@ -47,7 +47,7 @@ class Settings(BaseSettings):
api_port: int = 8000
# Legacy aliases retained for old utility modules.
regulations_collection: str = "regulations_dense_1024_v1"
regulations_collection: str = "regulations_dense_1024_v2"
compliance_collection: str = "compliance_cache"
# Preserve the legacy module API while keeping env resolution centralized at the repo root.

View File

@@ -8,18 +8,91 @@ from typing import Any
@dataclass
@dataclass(init=False)
class AnswerSource:
"""Represent answer source data."""
"""Represent answer source data with legacy aliases."""
doc_id: str
doc_name: str
doc_title: str
chunk_id: str
chunk_type: str
section_title: str
page_number: int
page_start: int
page_end: int
section_level: int
chunk_index: int
piece_index: int
score: float
content: str
text: str
metadata: dict[str, Any] = field(default_factory=dict)
def __init__(
self,
*,
doc_id: str,
doc_title: str | None = None,
chunk_id: str,
chunk_type: str = "",
section_title: str = "",
page_start: int = 0,
page_end: int = 0,
section_level: int = 0,
chunk_index: int = 0,
piece_index: int = 0,
score: float = 0.0,
text: str | None = None,
metadata: dict[str, Any] | None = None,
doc_name: str | None = None,
content: str | None = None,
page_number: int | None = None,
**_: Any,
) -> None:
"""Initialize the answer source while accepting legacy field names."""
self.doc_id = doc_id
self.doc_title = doc_title if doc_title is not None else (doc_name or "")
self.chunk_id = chunk_id
self.chunk_type = chunk_type
self.section_title = section_title
self.page_start = int(page_start or page_number or 0)
self.page_end = int(page_end or self.page_start)
self.section_level = int(section_level or 0)
self.chunk_index = int(chunk_index or 0)
self.piece_index = int(piece_index or 0)
self.score = float(score)
self.text = text if text is not None else (content or "")
self.metadata = dict(metadata or {})
@property
def doc_name(self) -> str:
"""Return the legacy document name alias."""
return self.doc_title
@doc_name.setter
def doc_name(self, value: str) -> None:
"""Update the legacy document name alias."""
self.doc_title = value
@property
def content(self) -> str:
"""Return the legacy content alias."""
return self.text
@content.setter
def content(self, value: str) -> None:
"""Update the legacy content alias."""
self.text = value
@property
def page_number(self) -> int:
"""Return the legacy page number alias."""
return self.page_start
@page_number.setter
def page_number(self, value: int) -> None:
"""Update the legacy page number alias."""
self.page_start = value
self.page_end = max(self.page_end, value)
@dataclass
class ConversationMessage:

View File

@@ -1,18 +1,29 @@
"""Initialize the app.domain.documents package."""
from .models import Chunk, Document, DocumentStatus, ParsedDocument
from .ports import ChunkBuilder, DocumentBinaryStore, DocumentParser, DocumentRepository, ParseArtifactStore
from .models import Chunk, Document, DocumentArtifact, DocumentProcessingRun, DocumentStatus, DocumentStatusEvent, ParsedDocument
from .ports import (
ChunkBuilder,
DocumentBinaryStore,
DocumentParser,
DocumentProcessingStore,
DocumentRepository,
ParseArtifactStore,
)
# Keep package boundaries explicit so backend imports stay predictable.
__all__ = [
"Chunk",
"Document",
"DocumentArtifact",
"DocumentProcessingRun",
"DocumentStatus",
"DocumentStatusEvent",
"ParsedDocument",
"ChunkBuilder",
"DocumentBinaryStore",
"DocumentParser",
"DocumentProcessingStore",
"DocumentRepository",
"ParseArtifactStore",
]

View File

@@ -60,19 +60,171 @@ class ParsedDocument:
metadata: dict[str, Any] = field(default_factory=dict)
@dataclass
@dataclass(init=False)
class Chunk:
"""Represent the Chunk type."""
"""Represent one retrieval chunk with backward-compatible aliases."""
chunk_id: str
doc_id: str
doc_name: str
content: str
doc_title: str
text: str
embedding_text: str
chunk_type: str = ""
chunk_index: int = 0
piece_index: int = 0
page_start: int = 0
page_end: int = 0
section_title: str = ""
section_path: list[str] = field(default_factory=list)
page_number: int = 0
section_level: int = 0
source_ids: list[str] = field(default_factory=list)
regulation_type: str = ""
version: str = ""
semantic_id: str = ""
block_type: str = ""
metadata: dict[str, Any] = field(default_factory=dict)
def __init__(
self,
*,
chunk_id: str,
doc_id: str,
doc_title: str | None = None,
text: str | None = None,
embedding_text: str = "",
chunk_type: str = "",
chunk_index: int = 0,
piece_index: int = 0,
page_start: int = 0,
page_end: int = 0,
section_title: str = "",
section_path: list[str] | None = None,
section_level: int = 0,
source_ids: list[str] | None = None,
regulation_type: str = "",
version: str = "",
semantic_id: str = "",
metadata: dict[str, Any] | None = None,
doc_name: str | None = None,
content: str | None = None,
page_number: int | None = None,
block_type: str | None = None,
**_: Any,
) -> None:
"""Initialize the chunk while accepting legacy field names."""
self.chunk_id = chunk_id
self.doc_id = doc_id
self.doc_title = doc_title if doc_title is not None else (doc_name or "")
self.text = text if text is not None else (content or "")
self.embedding_text = embedding_text or self.text
self.chunk_type = chunk_type or (block_type or "")
self.chunk_index = int(chunk_index or 0)
self.piece_index = int(piece_index or 0)
self.page_start = int(page_start or page_number or 0)
self.page_end = int(page_end or self.page_start)
self.section_title = section_title
self.section_path = list(section_path or [])
self.section_level = int(section_level or 0)
self.source_ids = list(source_ids or [])
self.regulation_type = regulation_type
self.version = version
self.semantic_id = semantic_id
self.metadata = dict(metadata or {})
@property
def doc_name(self) -> str:
"""Return the legacy document name alias."""
return self.doc_title
@doc_name.setter
def doc_name(self, value: str) -> None:
"""Update the legacy document name alias."""
self.doc_title = value
@property
def content(self) -> str:
"""Return the legacy content alias."""
return self.text
@content.setter
def content(self, value: str) -> None:
"""Update the legacy content alias."""
self.text = value
@property
def page_number(self) -> int:
"""Return the legacy page number alias."""
return self.page_start
@page_number.setter
def page_number(self, value: int) -> None:
"""Update the legacy page number alias."""
self.page_start = value
self.page_end = max(self.page_end, value)
@property
def block_type(self) -> str:
"""Return the legacy block type alias."""
return self.chunk_type
@block_type.setter
def block_type(self, value: str) -> None:
"""Update the legacy block type alias."""
self.chunk_type = value
@dataclass
class DocumentProcessingRun:
"""Represent one processing attempt for a document."""
run_id: str
doc_id: str
trigger_type: str
run_status: str
parser_backend: str = ""
chunk_backend: str = ""
embedding_model: str = ""
index_name: str = ""
started_at: datetime = field(default_factory=utcnow)
stored_at: datetime | None = None
parsed_at: datetime | None = None
indexed_at: datetime | None = None
finished_at: datetime | None = None
layout_count: int = 0
structure_node_count: int = 0
semantic_block_count: int = 0
vector_chunk_count: int = 0
chunk_count: int = 0
failure_stage: str = ""
error_message: str = ""
metadata: dict[str, Any] = field(default_factory=dict)
@dataclass
class DocumentStatusEvent:
"""Represent a document lifecycle event emitted during processing."""
event_id: str
doc_id: str
run_id: str
from_status: str
to_status: str
stage: str
message: str = ""
metadata: dict[str, Any] = field(default_factory=dict)
occurred_at: datetime = field(default_factory=utcnow)
@dataclass
class DocumentArtifact:
"""Represent a persisted artifact reference for one processing run."""
artifact_id: str
doc_id: str
run_id: str
artifact_type: str
object_name: str
content_type: str
byte_size: int = 0
checksum: str = ""
metadata: dict[str, Any] = field(default_factory=dict)
created_at: datetime = field(default_factory=utcnow)

View File

@@ -4,7 +4,7 @@ from __future__ import annotations
from abc import ABC, abstractmethod
from .models import Chunk, Document, DocumentStatus, ParsedDocument
from .models import Chunk, Document, DocumentArtifact, DocumentProcessingRun, DocumentStatus, DocumentStatusEvent, ParsedDocument
# Keep domain contracts explicit so adapters can swap implementations cleanly.
@@ -128,3 +128,111 @@ class ParseArtifactStore(ABC):
def get_structure_nodes(self, doc_id: str) -> list[dict]:
"""Return all structure nodes for a document."""
pass
class DocumentProcessingStore(ABC):
"""Persist document processing runs, events, and artifact references."""
@abstractmethod
def create_run(self, run: DocumentProcessingRun) -> DocumentProcessingRun:
"""Create a new processing run record."""
pass
@abstractmethod
def mark_run_stored(
self,
run_id: str,
*,
stored_at: object | None = None,
metadata: dict | None = None,
) -> DocumentProcessingRun | None:
"""Mark a run as having persisted the source file."""
pass
@abstractmethod
def mark_run_parsed(
self,
run_id: str,
*,
parser_backend: str,
layout_count: int,
structure_node_count: int,
semantic_block_count: int,
vector_chunk_count: int,
parsed_at: object | None = None,
metadata: dict | None = None,
) -> DocumentProcessingRun | None:
"""Record parse completion details for a run."""
pass
@abstractmethod
def mark_run_indexed(
self,
run_id: str,
*,
chunk_count: int,
index_name: str,
indexed_at: object | None = None,
finished_at: object | None = None,
metadata: dict | None = None,
) -> DocumentProcessingRun | None:
"""Mark a run as successfully indexed."""
pass
@abstractmethod
def mark_run_failed(
self,
run_id: str,
*,
failure_stage: str,
error_message: str,
finished_at: object | None = None,
metadata: dict | None = None,
) -> DocumentProcessingRun | None:
"""Mark a run as failed."""
pass
@abstractmethod
def append_status_event(self, event: DocumentStatusEvent) -> DocumentStatusEvent:
"""Append a document status event."""
pass
@abstractmethod
def replace_artifacts_for_run(self, run_id: str, artifacts: list[DocumentArtifact]) -> list[DocumentArtifact]:
"""Replace all artifacts for a run with the provided list."""
pass
@abstractmethod
def delete_by_document(self, doc_id: str) -> None:
"""Delete all processing data for a document."""
pass
@abstractmethod
def list_runs_by_document(self, doc_id: str) -> list[DocumentProcessingRun]:
"""List all processing runs for a document."""
pass
@abstractmethod
def get_run(self, run_id: str) -> DocumentProcessingRun | None:
"""Return one processing run by identifier."""
pass
@abstractmethod
def list_status_events_by_document(self, doc_id: str) -> list[DocumentStatusEvent]:
"""List status events for a document."""
pass
@abstractmethod
def list_status_events_by_run(self, run_id: str) -> list[DocumentStatusEvent]:
"""List status events for a run."""
pass
@abstractmethod
def list_artifacts_by_document(self, doc_id: str) -> list[DocumentArtifact]:
"""List artifact references for a document."""
pass
@abstractmethod
def list_artifacts_by_run(self, run_id: str) -> list[DocumentArtifact]:
"""List artifact references for a run."""
pass

View File

@@ -16,14 +16,88 @@ class RetrievalQuery:
filters: str | None = None
@dataclass
@dataclass(init=False)
class RetrievedChunk:
"""Represent the Retrieved Chunk type."""
"""Represent the retrieved chunk payload with legacy aliases."""
chunk_id: str
doc_id: str
doc_name: str
content: str
doc_title: str
text: str
score: float
chunk_type: str = ""
section_title: str = ""
page_number: int = 0
page_start: int = 0
page_end: int = 0
section_level: int = 0
chunk_index: int = 0
piece_index: int = 0
metadata: dict[str, Any] = field(default_factory=dict)
def __init__(
self,
*,
chunk_id: str,
doc_id: str,
doc_title: str | None = None,
text: str | None = None,
score: float = 0.0,
chunk_type: str = "",
section_title: str = "",
page_start: int = 0,
page_end: int = 0,
section_level: int = 0,
chunk_index: int = 0,
piece_index: int = 0,
metadata: dict[str, Any] | None = None,
doc_name: str | None = None,
content: str | None = None,
page_number: int | None = None,
block_type: str | None = None,
**_: Any,
) -> None:
"""Initialize the retrieved chunk while accepting legacy field names."""
self.chunk_id = chunk_id
self.doc_id = doc_id
self.doc_title = doc_title if doc_title is not None else (doc_name or "")
self.text = text if text is not None else (content or "")
self.score = float(score)
self.chunk_type = chunk_type or (block_type or "")
self.section_title = section_title
self.page_start = int(page_start or page_number or 0)
self.page_end = int(page_end or self.page_start)
self.section_level = int(section_level or 0)
self.chunk_index = int(chunk_index or 0)
self.piece_index = int(piece_index or 0)
self.metadata = dict(metadata or {})
@property
def doc_name(self) -> str:
"""Return the legacy document name alias."""
return self.doc_title
@doc_name.setter
def doc_name(self, value: str) -> None:
"""Update the legacy document name alias."""
self.doc_title = value
@property
def content(self) -> str:
"""Return the legacy content alias."""
return self.text
@content.setter
def content(self, value: str) -> None:
"""Update the legacy content alias."""
self.text = value
@property
def page_number(self) -> int:
"""Return the legacy page number alias."""
return self.page_start
@page_number.setter
def page_number(self, value: int) -> None:
"""Update the legacy page number alias."""
self.page_start = value
self.page_end = max(self.page_end, value)

View File

@@ -45,10 +45,10 @@ class OpenAICompatibleAnswerGenerator(AnswerGenerator):
context_tokens = 0
for idx, chunk in enumerate(retrieved_chunks, start=1):
block = (
f"[{idx}] 文档: {chunk.doc_name}\n"
f"[{idx}] 文档: {chunk.doc_title}\n"
f"章节: {chunk.section_title or '未标注'}\n"
f"页码: {chunk.page_number}\n"
f"内容: {chunk.content}"
f"页码: {chunk.page_start}" + (f"-{chunk.page_end}" if chunk.page_end and chunk.page_end != chunk.page_start else "") + "\n"
f"内容: {chunk.text}"
)
block_tokens = self._estimate_tokens(block)
if context_tokens + block_tokens > settings.rag_max_context_tokens:
@@ -67,17 +67,37 @@ class OpenAICompatibleAnswerGenerator(AnswerGenerator):
)
return messages, context_tokens
def _is_context_truncated(self, *, retrieved_chunks: list[RetrievedChunk], context_tokens: int) -> bool:
"""Return whether the prompt context had to omit retrieved chunks to fit the token budget."""
if not retrieved_chunks:
return False
estimated_total_tokens = sum(
self._estimate_tokens(
f"[{idx}] 文档: {chunk.doc_title}\n"
f"章节: {chunk.section_title or '未标注'}\n"
f"页码: {chunk.page_start}" + (f"-{chunk.page_end}" if chunk.page_end and chunk.page_end != chunk.page_start else "") + "\n"
f"内容: {chunk.text}"
)
for idx, chunk in enumerate(retrieved_chunks, start=1)
)
return estimated_total_tokens > context_tokens
def _sources(self, chunks: list[RetrievedChunk]) -> list[AnswerSource]:
"""Handle sources for this module for the Open A I Compatible Answer Generator instance."""
return [
AnswerSource(
doc_id=chunk.doc_id,
doc_name=chunk.doc_name,
doc_title=chunk.doc_title,
chunk_id=chunk.chunk_id,
chunk_type=chunk.chunk_type,
section_title=chunk.section_title,
page_number=chunk.page_number,
page_start=chunk.page_start,
page_end=chunk.page_end,
section_level=chunk.section_level,
chunk_index=chunk.chunk_index,
piece_index=chunk.piece_index,
score=chunk.score,
content=chunk.content,
text=chunk.text,
metadata=chunk.metadata,
)
for chunk in chunks
@@ -111,7 +131,10 @@ class OpenAICompatibleAnswerGenerator(AnswerGenerator):
latency_ms=latency_ms,
retrieved_count=len(retrieved_chunks),
context_tokens=context_tokens,
truncated=len(retrieved_chunks) > len(messages),
truncated=self._is_context_truncated(
retrieved_chunks=retrieved_chunks,
context_tokens=context_tokens,
),
error=response.error,
)

View File

@@ -10,6 +10,7 @@ class LocalRegulationChunkBuilder(ChunkBuilder):
"""Adapt the existing markdown chunker to the new chunk builder port."""
def __init__(self, *, chunk_size: int = 512, chunk_overlap: int = 50) -> None:
"""Initialize the local markdown chunk builder."""
self.chunker = RegulationChunker(
chunk_size=chunk_size,
chunk_overlap=chunk_overlap,
@@ -22,6 +23,7 @@ class LocalRegulationChunkBuilder(ChunkBuilder):
regulation_type: str,
version: str,
) -> list[Chunk]:
"""Build migrated chunk objects from the legacy markdown chunker output."""
markdown_text = parsed_document.raw_text.strip()
if not markdown_text:
return []
@@ -50,16 +52,18 @@ class LocalRegulationChunkBuilder(ChunkBuilder):
Chunk(
chunk_id=item.metadata.chunk_id,
doc_id=parsed_document.doc_id,
doc_name=parsed_document.doc_name,
content=item.content,
doc_title=parsed_document.doc_name,
text=item.content,
embedding_text=item.content,
chunk_type="local_markdown_chunk",
section_title=item.metadata.section_title or item.metadata.section_number,
section_path=section_path,
page_number=item.metadata.page_number,
page_start=item.metadata.page_number,
page_end=item.metadata.page_number,
section_level=len(section_path),
regulation_type=regulation_type,
version=version,
semantic_id=item.metadata.clause_number,
block_type="local_markdown_chunk",
metadata=metadata,
)
)

View File

@@ -19,29 +19,35 @@ class AliyunVectorChunkBuilder(ChunkBuilder):
"""Handle build for the Aliyun Vector Chunk Builder instance."""
chunks: list[Chunk] = []
for index, item in enumerate(parsed_document.vector_chunks):
content = item.get("content") or item.get("text") or ""
embedding_text = item.get("embedding_text") or content
text = item.get("text") or ""
embedding_text = item.get("embedding_text") or text
if not embedding_text.strip():
continue
section_path = item.get("section_path") or []
section_title = item.get("section_title") or (section_path[-1] if section_path else "")
page_number = item.get("page_start") or item.get("page") or 0
chunk_id = item.get("chunk_id") or f"{parsed_document.doc_id}-chunk-{index}"
metadata = {k: v for k, v in item.items() if k not in {"content", "embedding_text"}}
metadata = dict(item)
metadata["regulation_type"] = regulation_type
metadata["version"] = version
chunks.append(
Chunk(
chunk_id=str(chunk_id),
doc_id=parsed_document.doc_id,
doc_name=parsed_document.doc_name,
content=content,
doc_title=str(item.get("doc_title") or parsed_document.doc_name),
text=text,
embedding_text=embedding_text,
chunk_type=str(item.get("chunk_type", item.get("block_type", ""))),
chunk_index=int(item.get("chunk_index") or 0),
piece_index=int(item.get("piece_index") or 0),
page_start=int(item.get("page_start") or 0),
page_end=int(item.get("page_end") or 0),
section_title=section_title,
section_path=section_path,
page_number=int(page_number or 0),
section_level=int(item.get("section_level") or len(section_path)),
source_ids=[str(v) for v in item.get("source_ids", [])],
regulation_type=regulation_type,
version=version,
semantic_id=item.get("semantic_id", ""),
block_type=item.get("block_type", ""),
metadata=metadata,
)
)

View File

@@ -0,0 +1 @@
"""Perception infrastructure package."""

View File

@@ -0,0 +1,421 @@
"""Mock regulatory event store with 20 high-quality pre-seeded events."""
from __future__ import annotations
from typing import Any
MOCK_EVENTS: list[dict[str, Any]] = [
# ------------------------------------------------------------------ HIGH
{
"id": "evt-001",
"source": "MIIT",
"source_label": "工业和信息化部",
"standard_code": "GB 18384-2025",
"title": "《电动汽车安全要求》国家标准第三版正式发布",
"summary": (
"新增 IP67 级别高压系统密封防护要求;热失控预警响应时间压缩至 5 分钟;"
"调整碰撞安全测试工况,新增侧柱碰工况。本标准于 2026 年 7 月 1 日强制实施。"
),
"impact_level": "high",
"published_at": "2025-11-15",
"effective_at": "2026-07-01",
"category": "电动汽车安全",
"tags": ["电池安全", "高压防护", "碰撞安全", "热失控"],
"source_url": "https://openstd.samr.gov.cn",
"status": "enacted",
},
{
"id": "evt-002",
"source": "UN-ECE",
"source_label": "联合国欧洲经委会",
"standard_code": "UN R155 Amendment 3",
"title": "UN-ECE R155 网络安全法规第三次修订正式生效",
"summary": (
"新增对 OTA空中升级全生命周期的安全审计要求强化车辆 TARA"
"(威胁分析与风险评估)文档化义务;扩展 CSMS 监控范围至售后服务商。"
),
"impact_level": "high",
"published_at": "2026-01-20",
"effective_at": "2026-07-01",
"category": "网络安全",
"tags": ["OTA", "网络安全", "CSMS", "TARA", "R155"],
"source_url": "https://unece.org/transport/vehicle-regulations",
"status": "enacted",
},
{
"id": "evt-003",
"source": "MIIT",
"source_label": "工业和信息化部",
"standard_code": "GB/T 40429-2026征求意见稿",
"title": "《汽车整车信息安全技术要求》修订征求意见",
"summary": (
"增加基于人工智能的异常行为检测要求;新增车云通信双向认证机制规范;"
"提出数据最小化原则在车辆 OBD 数据收集中的应用细则。"
),
"impact_level": "high",
"published_at": "2026-03-05",
"effective_at": None,
"category": "信息安全",
"tags": ["信息安全", "数据安全", "AI检测", "OBD"],
"source_url": "https://www.miit.gov.cn/",
"status": "draft",
},
{
"id": "evt-004",
"source": "MIIT",
"source_label": "工业和信息化部",
"standard_code": "NEV 双积分 2026",
"title": "2026 年度新能源汽车双积分管理办法年度调整",
"summary": (
"纯电动乘用车标准车型积分CAFC基准值上调 8%"
"提高 A 级及以上续航里程门槛;新增氢燃料电池商用车积分计算细则。"
),
"impact_level": "high",
"published_at": "2026-02-28",
"effective_at": "2026-04-01",
"category": "新能源政策",
"tags": ["双积分", "纯电动", "燃料电池", "碳配额"],
"source_url": "https://www.miit.gov.cn/",
"status": "enacted",
},
{
"id": "evt-017",
"source": "MIIT",
"source_label": "工业和信息化部",
"standard_code": "智能网联汽车准入管理办法实施细则",
"title": "智能网联汽车准入管理实施细则正式落地",
"summary": (
"明确 L3 及以上自动驾驶功能的准入申报路径;"
"要求 OEM 建立数据安全管理体系并完成等保 2.0 三级认证;"
"道路测试数据留存期延长至 3 年。"
),
"impact_level": "high",
"published_at": "2026-03-01",
"effective_at": "2026-09-01",
"category": "智能网联",
"tags": ["智能网联", "L3自动驾驶", "准入管理", "数据留存"],
"source_url": "https://www.miit.gov.cn/",
"status": "enacted",
},
{
"id": "evt-018",
"source": "EUR-Lex",
"source_label": "欧盟官方公报",
"standard_code": "EU Cyber Resilience Act (CRA)",
"title": "《欧盟网络韧性法案》核心条款对车联网设备生效",
"summary": (
"联网汽车 ECU 须满足 CRA「重要类 II」安全要求"
"强制 SBOM软件物料清单公开披露"
"OEM 须提供至少 10 年的漏洞修复支持承诺。"
),
"impact_level": "high",
"published_at": "2026-02-15",
"effective_at": "2027-01-01",
"category": "网络安全",
"tags": ["CRA", "SBOM", "漏洞管理", "网络韧性"],
"source_url": "https://eur-lex.europa.eu",
"status": "enacted",
},
# --------------------------------------------------------------- MEDIUM
{
"id": "evt-005",
"source": "UN-ECE",
"source_label": "联合国欧洲经委会",
"standard_code": "UN R156 Amendment 2",
"title": "UN-ECE R156 软件升级与 SUMS 法规补充修订",
"summary": (
"明确 SUMS软件更新管理系统对 ECU 版本追溯的最低保留年限为 15 年;"
"新增售后 OTA 推送的用户知情同意要求规范。"
),
"impact_level": "medium",
"published_at": "2026-01-10",
"effective_at": "2026-07-01",
"category": "软件升级",
"tags": ["OTA", "SUMS", "软件版本", "R156"],
"source_url": "https://unece.org/transport/vehicle-regulations",
"status": "enacted",
},
{
"id": "evt-006",
"source": "国标委",
"source_label": "国家标准化管理委员会",
"standard_code": "GB/T 35273-2026",
"title": "《信息安全技术 个人信息安全规范》更新版发布",
"summary": (
"将车内人脸识别、声纹采集列为敏感个人信息;"
"补充自动驾驶场景下乘员行为数据的去标识化技术规范;"
"强化数据出境安全评估触发阈值。"
),
"impact_level": "medium",
"published_at": "2025-12-01",
"effective_at": "2026-06-01",
"category": "数据安全",
"tags": ["个人信息", "PIPL", "数据安全", "生物识别"],
"source_url": "https://openstd.samr.gov.cn",
"status": "enacted",
},
{
"id": "evt-007",
"source": "EUR-Lex",
"source_label": "欧盟官方公报",
"standard_code": "EU AI Act — Art.13 & Art.14",
"title": "《欧盟人工智能法案》第13-14条透明度与人工监督条款正式生效",
"summary": (
"要求在汽车 ADAS 系统中植入 AI 使用记录日志;"
"驾驶员监控 AI 系统须披露决策逻辑;"
"高风险 AI 系统需提供人工干预接口。"
),
"impact_level": "medium",
"published_at": "2026-02-01",
"effective_at": "2026-08-01",
"category": "AI 法规",
"tags": ["AI法案", "透明度", "ADAS", "高风险AI"],
"source_url": "https://eur-lex.europa.eu",
"status": "enacted",
},
{
"id": "evt-008",
"source": "ISO",
"source_label": "国际标准化组织",
"standard_code": "ISO 45001:2025 Amd.1",
"title": "ISO 45001 职业健康安全管理体系第一次修正",
"summary": (
"新增心理健康风险纳入 OHS 危害辨识范围;"
"明确远程办公人员安全管理职责;"
"更新绩效评价指标体系,新增事故未遂事件统计要求。"
),
"impact_level": "medium",
"published_at": "2025-10-20",
"effective_at": "2026-01-01",
"category": "EHS 管理",
"tags": ["ISO 45001", "EHS", "职业健康", "安全管理"],
"source_url": "https://www.iso.org",
"status": "enacted",
},
{
"id": "evt-009",
"source": "MIIT",
"source_label": "工业和信息化部",
"standard_code": "GB/T 28001-2026征求意见",
"title": "《汽车产品安全召回管理规程》修订征求意见",
"summary": (
"扩展召回触发条件,将 OTA 推送导致的功能异常纳入强制报告范围;"
"缩短重大安全隐患召回启动时限至 15 个工作日。"
),
"impact_level": "medium",
"published_at": "2026-03-15",
"effective_at": None,
"category": "召回管理",
"tags": ["召回", "OTA", "安全隐患", "产品安全"],
"source_url": "https://www.miit.gov.cn/",
"status": "draft",
},
{
"id": "evt-010",
"source": "国标委",
"source_label": "国家标准化管理委员会",
"standard_code": "GB 38031-2025",
"title": "《电动汽车用动力蓄电池安全要求》修订版发布",
"summary": (
"新增电池系统针刺、浸水、挤压等极端工况测试程序;"
"热扩散防护等级要求升级;"
"强化 BMS电池管理系统状态监测数据记录要求。"
),
"impact_level": "medium",
"published_at": "2025-09-15",
"effective_at": "2026-03-01",
"category": "电池安全",
"tags": ["动力电池", "BMS", "热扩散", "安全测试"],
"source_url": "https://openstd.samr.gov.cn",
"status": "enacted",
},
{
"id": "evt-016",
"source": "UN-ECE",
"source_label": "联合国欧洲经委会",
"standard_code": "UN R100 Rev.4(草案)",
"title": "UN R100 电动汽车安全认证法规第四次修订草案发布",
"summary": (
"拟对 400V 以上高压系统的绝缘电阻监测提出实时 CAN 总线传输要求;"
"新增极低温工况(-40°C的电池性能验证程序。"
),
"impact_level": "medium",
"published_at": "2026-04-08",
"effective_at": None,
"category": "电动汽车安全",
"tags": ["R100", "高压安全", "绝缘监测", "低温性能"],
"source_url": "https://unece.org/transport/vehicle-regulations",
"status": "draft",
},
{
"id": "evt-019",
"source": "ISO",
"source_label": "国际标准化组织",
"standard_code": "ISO/SAE 21434:2026 Amd.1",
"title": "ISO/SAE 21434 汽车网络安全工程第一次修正",
"summary": (
"将 AI 推理组件纳入汽车网络安全工程范围;"
"补充端到端加密通信在 V2X 场景中的 TARA 建模要求;"
"新增第三方 ECU 供应商 CSMS 审计方法。"
),
"impact_level": "medium",
"published_at": "2026-04-10",
"effective_at": "2026-10-01",
"category": "网络安全",
"tags": ["ISO 21434", "网络安全", "V2X", "AI安全"],
"source_url": "https://www.iso.org",
"status": "enacted",
},
# ------------------------------------------------------------------ LOW
{
"id": "evt-011",
"source": "ISO",
"source_label": "国际标准化组织",
"standard_code": "ISO 26262:2026 Ed.3(征求意见)",
"title": "ISO 26262 功能安全第三版征求意见启动",
"summary": (
"拟新增对 AI/ML 组件功能安全验证方法的指导附录;"
"讨论 SOTIF预期功能安全与 ISO 26262 的协调融合路径。"
),
"impact_level": "low",
"published_at": "2026-04-01",
"effective_at": None,
"category": "功能安全",
"tags": ["功能安全", "ASIL", "AI安全", "SOTIF"],
"source_url": "https://www.iso.org",
"status": "consultation",
},
{
"id": "evt-012",
"source": "EUR-Lex",
"source_label": "欧盟官方公报",
"standard_code": "REACH Regulation Update 2026",
"title": "欧盟 REACH 法规限制物质清单更新(第 22 批)",
"summary": (
"新增 3 种 SVHCs高度关注物质包括特定阻燃剂和密封材料成分"
"汽车零部件豁免条款调整,影响部分内饰材料供应商。"
),
"impact_level": "low",
"published_at": "2026-01-30",
"effective_at": "2026-09-01",
"category": "环保法规",
"tags": ["REACH", "SVHCs", "环保", "化学品管理"],
"source_url": "https://eur-lex.europa.eu",
"status": "enacted",
},
{
"id": "evt-013",
"source": "MIIT",
"source_label": "工业和信息化部",
"standard_code": "CCER 汽车碳配额 2026",
"title": "自愿减排CCER汽车行业核算方法学更新",
"summary": (
"更新纯电动汽车全生命周期碳排放核算边界;"
"新增动力电池回收环节碳减排量认定方法;"
"与全国碳市场对接的企业碳账户数据接口规范发布。"
),
"impact_level": "low",
"published_at": "2026-02-10",
"effective_at": "2026-06-01",
"category": "碳排放",
"tags": ["CCER", "碳排放", "碳中和", "碳核算"],
"source_url": "https://www.miit.gov.cn/",
"status": "enacted",
},
{
"id": "evt-014",
"source": "IATF",
"source_label": "国际汽车工作组",
"standard_code": "IATF 16949:2025 CSR 通告",
"title": "IATF 16949 质量管理体系客户特殊要求更新通告",
"summary": (
"多家主机厂OEM同步更新 CSR涵盖软件定义汽车SDV"
"场景下的质量过程管控;电子电气 BOM 变更管理流程补充规范。"
),
"impact_level": "low",
"published_at": "2026-03-20",
"effective_at": "2026-07-01",
"category": "质量管理",
"tags": ["IATF 16949", "质量管理", "SDV", "CSR"],
"source_url": "https://www.iatfglobaloversight.org",
"status": "enacted",
},
{
"id": "evt-015",
"source": "国标委",
"source_label": "国家标准化管理委员会",
"standard_code": "GB 7258-2025 勘误",
"title": "《机动车运行安全技术条件》年度勘误发布",
"summary": (
"更正第 12 章灯光系统技术要求中的参数引用错误;"
"澄清前雾灯安装位置尺寸定义;此次为勘误性修订,不影响已认证车型。"
),
"impact_level": "low",
"published_at": "2026-01-05",
"effective_at": "2026-01-05",
"category": "运行安全",
"tags": ["GB 7258", "灯光", "运行安全", "勘误"],
"source_url": "https://openstd.samr.gov.cn",
"status": "enacted",
},
{
"id": "evt-020",
"source": "国标委",
"source_label": "国家标准化管理委员会",
"standard_code": "GB/T 27930-2026",
"title": "《电动汽车非车载传导式充电通信协议》更新版发布",
"summary": (
"兼容 CHAdeMO 4.0 与 CCS2 双协议栈;"
"新增大功率充电(>350kW通信握手流程"
"强化充电过程 BMS 实时诊断数据上报规范。"
),
"impact_level": "low",
"published_at": "2026-03-25",
"effective_at": "2026-12-01",
"category": "充电标准",
"tags": ["充电协议", "BMS", "大功率充电", "CHAdeMO"],
"source_url": "https://openstd.samr.gov.cn",
"status": "enacted",
},
]
# Index for fast lookup
_EVENT_INDEX: dict[str, dict] = {e["id"]: e for e in MOCK_EVENTS}
class MockEventStore:
"""In-memory mock store for regulatory events."""
def all(self) -> list[dict]:
return list(MOCK_EVENTS)
def get(self, event_id: str) -> dict | None:
return _EVENT_INDEX.get(event_id)
def filter(
self,
*,
source: str | None = None,
impact_level: str | None = None,
limit: int = 50,
) -> list[dict]:
events = list(MOCK_EVENTS)
if source:
events = [e for e in events if e["source"] == source]
if impact_level:
events = [e for e in events if e["impact_level"] == impact_level]
events.sort(key=lambda e: e["published_at"], reverse=True)
return events[:limit]
def stats(self) -> dict:
from datetime import date, timedelta
events = MOCK_EVENTS
cutoff = (date.today() - timedelta(days=90)).isoformat()
return {
"total": len(events),
"high_impact": sum(1 for e in events if e["impact_level"] == "high"),
"medium_impact": sum(1 for e in events if e["impact_level"] == "medium"),
"low_impact": sum(1 for e in events if e["impact_level"] == "low"),
"recent_90d": sum(1 for e in events if e["published_at"] >= cutoff),
}

View File

@@ -0,0 +1,373 @@
"""Implement infrastructure support for json document processing history."""
from __future__ import annotations
import json
from datetime import UTC, datetime
from pathlib import Path
from typing import Any
from app.domain.documents import DocumentArtifact, DocumentProcessingRun, DocumentProcessingStore, DocumentStatusEvent
# Keep JSON persistence behavior aligned with the lightweight document repository adapter.
class JsonDocumentProcessingStore(DocumentProcessingStore):
"""Persist processing history in a standalone JSON file."""
def __init__(self, file_path: str) -> None:
"""Initialize the JSON processing history store."""
self.file_path = Path(file_path)
self.file_path.parent.mkdir(parents=True, exist_ok=True)
if not self.file_path.exists():
self._save(self._empty_payload())
def _empty_payload(self) -> dict[str, dict[str, dict[str, Any]]]:
"""Return the canonical empty JSON structure for processing history."""
return {"runs": {}, "status_events": {}, "artifacts": {}}
def _load(self) -> dict[str, dict[str, dict[str, Any]]]:
"""Load the full JSON payload and normalize missing sections."""
if not self.file_path.exists():
return self._empty_payload()
payload = json.loads(self.file_path.read_text(encoding="utf-8") or "{}")
normalized = self._empty_payload()
for key in normalized:
section = payload.get(key, {})
normalized[key] = section if isinstance(section, dict) else {}
return normalized
def _save(self, payload: dict[str, dict[str, dict[str, Any]]]) -> None:
"""Persist the full JSON payload with stable formatting."""
self.file_path.write_text(json.dumps(payload, ensure_ascii=False, indent=2), encoding="utf-8")
def _serialize_datetime(self, value: datetime | None) -> str | None:
"""Serialize optional datetimes into ISO8601 strings."""
return value.isoformat() if value is not None else None
def _deserialize_datetime(self, value: str | None) -> datetime | None:
"""Deserialize optional ISO8601 strings into datetimes."""
return datetime.fromisoformat(value) if value else None
def _serialize_run(self, run: DocumentProcessingRun) -> dict[str, Any]:
"""Serialize one processing run to a JSON-compatible payload."""
return {
"run_id": run.run_id,
"doc_id": run.doc_id,
"trigger_type": run.trigger_type,
"run_status": run.run_status,
"parser_backend": run.parser_backend,
"chunk_backend": run.chunk_backend,
"embedding_model": run.embedding_model,
"index_name": run.index_name,
"started_at": self._serialize_datetime(run.started_at),
"stored_at": self._serialize_datetime(run.stored_at),
"parsed_at": self._serialize_datetime(run.parsed_at),
"indexed_at": self._serialize_datetime(run.indexed_at),
"finished_at": self._serialize_datetime(run.finished_at),
"layout_count": run.layout_count,
"structure_node_count": run.structure_node_count,
"semantic_block_count": run.semantic_block_count,
"vector_chunk_count": run.vector_chunk_count,
"chunk_count": run.chunk_count,
"failure_stage": run.failure_stage,
"error_message": run.error_message,
"metadata": run.metadata,
}
def _deserialize_run(self, payload: dict[str, Any]) -> DocumentProcessingRun:
"""Deserialize one JSON payload into a processing run dataclass."""
return DocumentProcessingRun(
run_id=payload["run_id"],
doc_id=payload["doc_id"],
trigger_type=payload["trigger_type"],
run_status=payload["run_status"],
parser_backend=payload.get("parser_backend", ""),
chunk_backend=payload.get("chunk_backend", ""),
embedding_model=payload.get("embedding_model", ""),
index_name=payload.get("index_name", ""),
started_at=self._deserialize_datetime(payload.get("started_at")) or datetime.now(UTC),
stored_at=self._deserialize_datetime(payload.get("stored_at")),
parsed_at=self._deserialize_datetime(payload.get("parsed_at")),
indexed_at=self._deserialize_datetime(payload.get("indexed_at")),
finished_at=self._deserialize_datetime(payload.get("finished_at")),
layout_count=int(payload.get("layout_count", 0) or 0),
structure_node_count=int(payload.get("structure_node_count", 0) or 0),
semantic_block_count=int(payload.get("semantic_block_count", 0) or 0),
vector_chunk_count=int(payload.get("vector_chunk_count", 0) or 0),
chunk_count=int(payload.get("chunk_count", 0) or 0),
failure_stage=payload.get("failure_stage", ""),
error_message=payload.get("error_message", ""),
metadata=payload.get("metadata", {}),
)
def _serialize_event(self, event: DocumentStatusEvent) -> dict[str, Any]:
"""Serialize one status event to a JSON-compatible payload."""
return {
"event_id": event.event_id,
"doc_id": event.doc_id,
"run_id": event.run_id,
"from_status": event.from_status,
"to_status": event.to_status,
"stage": event.stage,
"message": event.message,
"metadata": event.metadata,
"occurred_at": self._serialize_datetime(event.occurred_at),
}
def _deserialize_event(self, payload: dict[str, Any]) -> DocumentStatusEvent:
"""Deserialize one JSON payload into a status event dataclass."""
return DocumentStatusEvent(
event_id=payload["event_id"],
doc_id=payload["doc_id"],
run_id=payload["run_id"],
from_status=payload.get("from_status", ""),
to_status=payload["to_status"],
stage=payload.get("stage", ""),
message=payload.get("message", ""),
metadata=payload.get("metadata", {}),
occurred_at=self._deserialize_datetime(payload.get("occurred_at")) or datetime.now(UTC),
)
def _serialize_artifact(self, artifact: DocumentArtifact) -> dict[str, Any]:
"""Serialize one artifact reference to a JSON-compatible payload."""
return {
"artifact_id": artifact.artifact_id,
"doc_id": artifact.doc_id,
"run_id": artifact.run_id,
"artifact_type": artifact.artifact_type,
"object_name": artifact.object_name,
"content_type": artifact.content_type,
"byte_size": artifact.byte_size,
"checksum": artifact.checksum,
"metadata": artifact.metadata,
"created_at": self._serialize_datetime(artifact.created_at),
}
def _deserialize_artifact(self, payload: dict[str, Any]) -> DocumentArtifact:
"""Deserialize one JSON payload into an artifact dataclass."""
return DocumentArtifact(
artifact_id=payload["artifact_id"],
doc_id=payload["doc_id"],
run_id=payload["run_id"],
artifact_type=payload["artifact_type"],
object_name=payload["object_name"],
content_type=payload.get("content_type", ""),
byte_size=int(payload.get("byte_size", 0) or 0),
checksum=payload.get("checksum", ""),
metadata=payload.get("metadata", {}),
created_at=self._deserialize_datetime(payload.get("created_at")) or datetime.now(UTC),
)
def _merge_metadata(self, original: dict[str, Any], update: dict | None) -> dict[str, Any]:
"""Merge metadata updates onto an existing payload."""
merged = dict(original)
if update:
merged.update(update)
return merged
def create_run(self, run: DocumentProcessingRun) -> DocumentProcessingRun:
"""Create a new processing run record."""
payload = self._load()
payload["runs"][run.run_id] = self._serialize_run(run)
self._save(payload)
return run
def mark_run_stored(
self,
run_id: str,
*,
stored_at: datetime | None = None,
metadata: dict | None = None,
) -> DocumentProcessingRun | None:
"""Mark a run as having persisted the source file."""
payload = self._load()
run_payload = payload["runs"].get(run_id)
if not run_payload:
return None
run = self._deserialize_run(run_payload)
run.stored_at = stored_at or datetime.now(UTC)
run.metadata = self._merge_metadata(run.metadata, metadata)
payload["runs"][run_id] = self._serialize_run(run)
self._save(payload)
return run
def mark_run_parsed(
self,
run_id: str,
*,
parser_backend: str,
layout_count: int,
structure_node_count: int,
semantic_block_count: int,
vector_chunk_count: int,
parsed_at: datetime | None = None,
metadata: dict | None = None,
) -> DocumentProcessingRun | None:
"""Record parse completion details for a run."""
payload = self._load()
run_payload = payload["runs"].get(run_id)
if not run_payload:
return None
run = self._deserialize_run(run_payload)
run.parser_backend = parser_backend
run.layout_count = layout_count
run.structure_node_count = structure_node_count
run.semantic_block_count = semantic_block_count
run.vector_chunk_count = vector_chunk_count
run.parsed_at = parsed_at or datetime.now(UTC)
run.metadata = self._merge_metadata(run.metadata, metadata)
payload["runs"][run_id] = self._serialize_run(run)
self._save(payload)
return run
def mark_run_indexed(
self,
run_id: str,
*,
chunk_count: int,
index_name: str,
indexed_at: datetime | None = None,
finished_at: datetime | None = None,
metadata: dict | None = None,
) -> DocumentProcessingRun | None:
"""Mark a run as successfully indexed."""
payload = self._load()
run_payload = payload["runs"].get(run_id)
if not run_payload:
return None
run = self._deserialize_run(run_payload)
now = datetime.now(UTC)
run.run_status = "succeeded"
run.chunk_count = chunk_count
run.index_name = index_name
run.indexed_at = indexed_at or now
run.finished_at = finished_at or now
run.metadata = self._merge_metadata(run.metadata, metadata)
payload["runs"][run_id] = self._serialize_run(run)
self._save(payload)
return run
def mark_run_failed(
self,
run_id: str,
*,
failure_stage: str,
error_message: str,
finished_at: datetime | None = None,
metadata: dict | None = None,
) -> DocumentProcessingRun | None:
"""Mark a run as failed."""
payload = self._load()
run_payload = payload["runs"].get(run_id)
if not run_payload:
return None
run = self._deserialize_run(run_payload)
run.run_status = "failed"
run.failure_stage = failure_stage
run.error_message = error_message
run.finished_at = finished_at or datetime.now(UTC)
run.metadata = self._merge_metadata(run.metadata, metadata)
payload["runs"][run_id] = self._serialize_run(run)
self._save(payload)
return run
def append_status_event(self, event: DocumentStatusEvent) -> DocumentStatusEvent:
"""Append a document status event."""
payload = self._load()
payload["status_events"][event.event_id] = self._serialize_event(event)
self._save(payload)
return event
def replace_artifacts_for_run(self, run_id: str, artifacts: list[DocumentArtifact]) -> list[DocumentArtifact]:
"""Replace all artifacts for a run with the provided list."""
payload = self._load()
payload["artifacts"] = {
artifact_id: artifact_payload
for artifact_id, artifact_payload in payload["artifacts"].items()
if artifact_payload.get("run_id") != run_id
}
for artifact in artifacts:
payload["artifacts"][artifact.artifact_id] = self._serialize_artifact(artifact)
self._save(payload)
return artifacts
def delete_by_document(self, doc_id: str) -> None:
"""Delete all processing data for a document."""
payload = self._load()
payload["runs"] = {
run_id: run_payload
for run_id, run_payload in payload["runs"].items()
if run_payload.get("doc_id") != doc_id
}
payload["status_events"] = {
event_id: event_payload
for event_id, event_payload in payload["status_events"].items()
if event_payload.get("doc_id") != doc_id
}
payload["artifacts"] = {
artifact_id: artifact_payload
for artifact_id, artifact_payload in payload["artifacts"].items()
if artifact_payload.get("doc_id") != doc_id
}
self._save(payload)
def list_runs_by_document(self, doc_id: str) -> list[DocumentProcessingRun]:
"""List all processing runs for a document."""
payload = self._load()
runs = [
self._deserialize_run(run_payload)
for run_payload in payload["runs"].values()
if run_payload.get("doc_id") == doc_id
]
runs.sort(key=lambda run: run.started_at)
return runs
def get_run(self, run_id: str) -> DocumentProcessingRun | None:
"""Return one processing run by identifier."""
payload = self._load()
run_payload = payload["runs"].get(run_id)
return self._deserialize_run(run_payload) if run_payload else None
def list_status_events_by_document(self, doc_id: str) -> list[DocumentStatusEvent]:
"""List status events for a document."""
payload = self._load()
events = [
self._deserialize_event(event_payload)
for event_payload in payload["status_events"].values()
if event_payload.get("doc_id") == doc_id
]
events.sort(key=lambda event: event.occurred_at)
return events
def list_status_events_by_run(self, run_id: str) -> list[DocumentStatusEvent]:
"""List status events for a run."""
payload = self._load()
events = [
self._deserialize_event(event_payload)
for event_payload in payload["status_events"].values()
if event_payload.get("run_id") == run_id
]
events.sort(key=lambda event: event.occurred_at)
return events
def list_artifacts_by_document(self, doc_id: str) -> list[DocumentArtifact]:
"""List artifact references for a document."""
payload = self._load()
artifacts = [
self._deserialize_artifact(artifact_payload)
for artifact_payload in payload["artifacts"].values()
if artifact_payload.get("doc_id") == doc_id
]
artifacts.sort(key=lambda artifact: artifact.created_at)
return artifacts
def list_artifacts_by_run(self, run_id: str) -> list[DocumentArtifact]:
"""List artifact references for a run."""
payload = self._load()
artifacts = [
self._deserialize_artifact(artifact_payload)
for artifact_payload in payload["artifacts"].values()
if artifact_payload.get("run_id") == run_id
]
artifacts.sort(key=lambda artifact: artifact.created_at)
return artifacts

View File

@@ -0,0 +1,466 @@
"""Implement infrastructure support for postgres document processing history."""
from __future__ import annotations
import json
from contextlib import contextmanager
from datetime import UTC, datetime
from typing import Any
import psycopg2
import psycopg2.extras
from psycopg2.pool import ThreadedConnectionPool
from app.config.settings import settings
from app.domain.documents import DocumentArtifact, DocumentProcessingRun, DocumentProcessingStore, DocumentStatusEvent
# Keep SQL mapping local to this adapter so the domain stays storage-agnostic.
_CREATE_RUNS_TABLE = """
CREATE TABLE IF NOT EXISTS document_processing_runs (
run_id VARCHAR(128) PRIMARY KEY,
doc_id VARCHAR(128) NOT NULL,
trigger_type VARCHAR(32) NOT NULL,
run_status VARCHAR(32) NOT NULL DEFAULT 'running',
parser_backend VARCHAR(128) NOT NULL DEFAULT '',
chunk_backend VARCHAR(128) NOT NULL DEFAULT '',
embedding_model VARCHAR(256) NOT NULL DEFAULT '',
index_name VARCHAR(128) NOT NULL DEFAULT '',
started_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
stored_at TIMESTAMPTZ,
parsed_at TIMESTAMPTZ,
indexed_at TIMESTAMPTZ,
finished_at TIMESTAMPTZ,
layout_count INTEGER NOT NULL DEFAULT 0,
structure_node_count INTEGER NOT NULL DEFAULT 0,
semantic_block_count INTEGER NOT NULL DEFAULT 0,
vector_chunk_count INTEGER NOT NULL DEFAULT 0,
chunk_count INTEGER NOT NULL DEFAULT 0,
failure_stage VARCHAR(64) NOT NULL DEFAULT '',
error_message TEXT NOT NULL DEFAULT '',
metadata JSONB NOT NULL DEFAULT '{}',
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
CONSTRAINT fk_dpr_doc FOREIGN KEY (doc_id) REFERENCES documents(doc_id) ON DELETE CASCADE
);
CREATE INDEX IF NOT EXISTS idx_document_processing_runs_doc_id ON document_processing_runs(doc_id, started_at DESC);
"""
_CREATE_EVENTS_TABLE = """
CREATE TABLE IF NOT EXISTS document_status_history (
event_id VARCHAR(128) PRIMARY KEY,
doc_id VARCHAR(128) NOT NULL,
run_id VARCHAR(128) NOT NULL,
from_status VARCHAR(32) NOT NULL DEFAULT '',
to_status VARCHAR(32) NOT NULL,
stage VARCHAR(64) NOT NULL DEFAULT '',
message TEXT NOT NULL DEFAULT '',
metadata JSONB NOT NULL DEFAULT '{}',
occurred_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
CONSTRAINT fk_dsh_doc FOREIGN KEY (doc_id) REFERENCES documents(doc_id) ON DELETE CASCADE,
CONSTRAINT fk_dsh_run FOREIGN KEY (run_id) REFERENCES document_processing_runs(run_id) ON DELETE CASCADE
);
CREATE INDEX IF NOT EXISTS idx_document_status_history_doc_id ON document_status_history(doc_id, occurred_at ASC);
CREATE INDEX IF NOT EXISTS idx_document_status_history_run_id ON document_status_history(run_id, occurred_at ASC);
"""
_CREATE_ARTIFACTS_TABLE = """
CREATE TABLE IF NOT EXISTS document_artifacts (
artifact_id VARCHAR(128) PRIMARY KEY,
doc_id VARCHAR(128) NOT NULL,
run_id VARCHAR(128) NOT NULL,
artifact_type VARCHAR(64) NOT NULL,
object_name VARCHAR(1024) NOT NULL,
content_type VARCHAR(128) NOT NULL DEFAULT '',
byte_size BIGINT NOT NULL DEFAULT 0,
checksum VARCHAR(256) NOT NULL DEFAULT '',
metadata JSONB NOT NULL DEFAULT '{}',
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
CONSTRAINT fk_da_doc FOREIGN KEY (doc_id) REFERENCES documents(doc_id) ON DELETE CASCADE,
CONSTRAINT fk_da_run FOREIGN KEY (run_id) REFERENCES document_processing_runs(run_id) ON DELETE CASCADE
);
CREATE INDEX IF NOT EXISTS idx_document_artifacts_doc_id ON document_artifacts(doc_id, created_at ASC);
CREATE INDEX IF NOT EXISTS idx_document_artifacts_run_id ON document_artifacts(run_id, created_at ASC);
"""
class PostgresDocumentProcessingStore(DocumentProcessingStore):
"""Persist processing history in PostgreSQL using handwritten SQL."""
def __init__(self) -> None:
"""Initialize the store and ensure the required tables exist."""
self._pool = ThreadedConnectionPool(
minconn=1,
maxconn=5,
host=settings.postgres_host,
port=settings.postgres_port,
user=settings.postgres_user,
password=settings.postgres_password,
dbname=settings.postgres_db,
)
self._ensure_schema()
def _ensure_schema(self) -> None:
"""Create processing history tables and indexes if they are missing."""
with self._conn() as conn:
with conn.cursor() as cur:
cur.execute(_CREATE_RUNS_TABLE)
cur.execute(_CREATE_EVENTS_TABLE)
cur.execute(_CREATE_ARTIFACTS_TABLE)
conn.commit()
@contextmanager
def _conn(self):
"""Borrow one connection from the pool and return it afterwards."""
conn = self._pool.getconn()
try:
yield conn
finally:
self._pool.putconn(conn)
def _normalize_metadata(self, value: Any) -> dict[str, Any]:
"""Return a JSON-object payload regardless of the row representation."""
if isinstance(value, dict):
return value
if not value:
return {}
return json.loads(value)
def _row_to_run(self, row: dict[str, Any]) -> DocumentProcessingRun:
"""Map one run row into the domain dataclass."""
return DocumentProcessingRun(
run_id=row["run_id"],
doc_id=row["doc_id"],
trigger_type=row["trigger_type"],
run_status=row["run_status"],
parser_backend=row["parser_backend"],
chunk_backend=row["chunk_backend"],
embedding_model=row["embedding_model"],
index_name=row["index_name"],
started_at=row["started_at"],
stored_at=row["stored_at"],
parsed_at=row["parsed_at"],
indexed_at=row["indexed_at"],
finished_at=row["finished_at"],
layout_count=row["layout_count"],
structure_node_count=row["structure_node_count"],
semantic_block_count=row["semantic_block_count"],
vector_chunk_count=row["vector_chunk_count"],
chunk_count=row["chunk_count"],
failure_stage=row["failure_stage"],
error_message=row["error_message"],
metadata=self._normalize_metadata(row["metadata"]),
)
def _row_to_event(self, row: dict[str, Any]) -> DocumentStatusEvent:
"""Map one event row into the domain dataclass."""
return DocumentStatusEvent(
event_id=row["event_id"],
doc_id=row["doc_id"],
run_id=row["run_id"],
from_status=row["from_status"],
to_status=row["to_status"],
stage=row["stage"],
message=row["message"],
metadata=self._normalize_metadata(row["metadata"]),
occurred_at=row["occurred_at"],
)
def _row_to_artifact(self, row: dict[str, Any]) -> DocumentArtifact:
"""Map one artifact row into the domain dataclass."""
return DocumentArtifact(
artifact_id=row["artifact_id"],
doc_id=row["doc_id"],
run_id=row["run_id"],
artifact_type=row["artifact_type"],
object_name=row["object_name"],
content_type=row["content_type"],
byte_size=row["byte_size"],
checksum=row["checksum"],
metadata=self._normalize_metadata(row["metadata"]),
created_at=row["created_at"],
)
def _update_run(
self,
run_id: str,
*,
assignments: dict[str, Any],
metadata: dict | None = None,
) -> DocumentProcessingRun | None:
"""Update one run row and return the latest stored state."""
set_clauses = []
params: dict[str, Any] = {"run_id": run_id, "updated_at": datetime.now(UTC)}
for key, value in assignments.items():
set_clauses.append(f"{key} = %({key})s")
params[key] = value
set_clauses.append("updated_at = %(updated_at)s")
if metadata is not None:
set_clauses.append("metadata = COALESCE(metadata, '{}'::jsonb) || %(metadata)s::jsonb")
params["metadata"] = json.dumps(metadata, ensure_ascii=False)
sql = f"""
UPDATE document_processing_runs
SET {", ".join(set_clauses)}
WHERE run_id = %(run_id)s
RETURNING *
"""
with self._conn() as conn:
with conn.cursor(cursor_factory=psycopg2.extras.RealDictCursor) as cur:
cur.execute(sql, params)
row = cur.fetchone()
conn.commit()
return self._row_to_run(dict(row)) if row else None
def create_run(self, run: DocumentProcessingRun) -> DocumentProcessingRun:
"""Create a new processing run record."""
sql = """
INSERT INTO document_processing_runs
(run_id, doc_id, trigger_type, run_status, parser_backend, chunk_backend,
embedding_model, index_name, started_at, stored_at, parsed_at, indexed_at,
finished_at, layout_count, structure_node_count, semantic_block_count,
vector_chunk_count, chunk_count, failure_stage, error_message, metadata)
VALUES
(%(run_id)s, %(doc_id)s, %(trigger_type)s, %(run_status)s, %(parser_backend)s,
%(chunk_backend)s, %(embedding_model)s, %(index_name)s, %(started_at)s,
%(stored_at)s, %(parsed_at)s, %(indexed_at)s, %(finished_at)s, %(layout_count)s,
%(structure_node_count)s, %(semantic_block_count)s, %(vector_chunk_count)s,
%(chunk_count)s, %(failure_stage)s, %(error_message)s, %(metadata)s)
"""
with self._conn() as conn:
with conn.cursor() as cur:
cur.execute(
sql,
{
"run_id": run.run_id,
"doc_id": run.doc_id,
"trigger_type": run.trigger_type,
"run_status": run.run_status,
"parser_backend": run.parser_backend,
"chunk_backend": run.chunk_backend,
"embedding_model": run.embedding_model,
"index_name": run.index_name,
"started_at": run.started_at,
"stored_at": run.stored_at,
"parsed_at": run.parsed_at,
"indexed_at": run.indexed_at,
"finished_at": run.finished_at,
"layout_count": run.layout_count,
"structure_node_count": run.structure_node_count,
"semantic_block_count": run.semantic_block_count,
"vector_chunk_count": run.vector_chunk_count,
"chunk_count": run.chunk_count,
"failure_stage": run.failure_stage,
"error_message": run.error_message,
"metadata": json.dumps(run.metadata, ensure_ascii=False),
},
)
conn.commit()
return run
def mark_run_stored(
self,
run_id: str,
*,
stored_at: datetime | None = None,
metadata: dict | None = None,
) -> DocumentProcessingRun | None:
"""Mark a run as having persisted its source file."""
return self._update_run(
run_id,
assignments={"stored_at": stored_at or datetime.now(UTC)},
metadata=metadata,
)
def mark_run_parsed(
self,
run_id: str,
*,
parser_backend: str,
layout_count: int,
structure_node_count: int,
semantic_block_count: int,
vector_chunk_count: int,
parsed_at: datetime | None = None,
metadata: dict | None = None,
) -> DocumentProcessingRun | None:
"""Record parse completion metrics for a run."""
return self._update_run(
run_id,
assignments={
"parser_backend": parser_backend,
"parsed_at": parsed_at or datetime.now(UTC),
"layout_count": layout_count,
"structure_node_count": structure_node_count,
"semantic_block_count": semantic_block_count,
"vector_chunk_count": vector_chunk_count,
},
metadata=metadata,
)
def mark_run_indexed(
self,
run_id: str,
*,
chunk_count: int,
index_name: str,
indexed_at: datetime | None = None,
finished_at: datetime | None = None,
metadata: dict | None = None,
) -> DocumentProcessingRun | None:
"""Mark a run as successfully indexed."""
now = datetime.now(UTC)
return self._update_run(
run_id,
assignments={
"run_status": "succeeded",
"chunk_count": chunk_count,
"index_name": index_name,
"indexed_at": indexed_at or now,
"finished_at": finished_at or now,
},
metadata=metadata,
)
def mark_run_failed(
self,
run_id: str,
*,
failure_stage: str,
error_message: str,
finished_at: datetime | None = None,
metadata: dict | None = None,
) -> DocumentProcessingRun | None:
"""Mark a run as failed and persist the terminal error details."""
return self._update_run(
run_id,
assignments={
"run_status": "failed",
"failure_stage": failure_stage,
"error_message": error_message,
"finished_at": finished_at or datetime.now(UTC),
},
metadata=metadata,
)
def append_status_event(self, event: DocumentStatusEvent) -> DocumentStatusEvent:
"""Append a document status event."""
sql = """
INSERT INTO document_status_history
(event_id, doc_id, run_id, from_status, to_status, stage, message, metadata, occurred_at)
VALUES
(%(event_id)s, %(doc_id)s, %(run_id)s, %(from_status)s, %(to_status)s,
%(stage)s, %(message)s, %(metadata)s, %(occurred_at)s)
"""
with self._conn() as conn:
with conn.cursor() as cur:
cur.execute(
sql,
{
"event_id": event.event_id,
"doc_id": event.doc_id,
"run_id": event.run_id,
"from_status": event.from_status,
"to_status": event.to_status,
"stage": event.stage,
"message": event.message,
"metadata": json.dumps(event.metadata, ensure_ascii=False),
"occurred_at": event.occurred_at,
},
)
conn.commit()
return event
def replace_artifacts_for_run(self, run_id: str, artifacts: list[DocumentArtifact]) -> list[DocumentArtifact]:
"""Replace all artifact references for one run using a delete-then-insert strategy."""
with self._conn() as conn:
with conn.cursor() as cur:
cur.execute("DELETE FROM document_artifacts WHERE run_id = %s", (run_id,))
if artifacts:
psycopg2.extras.execute_values(
cur,
"""
INSERT INTO document_artifacts
(artifact_id, doc_id, run_id, artifact_type, object_name,
content_type, byte_size, checksum, metadata, created_at)
VALUES %s
""",
[
(
artifact.artifact_id,
artifact.doc_id,
artifact.run_id,
artifact.artifact_type,
artifact.object_name,
artifact.content_type,
artifact.byte_size,
artifact.checksum,
json.dumps(artifact.metadata, ensure_ascii=False),
artifact.created_at,
)
for artifact in artifacts
],
)
conn.commit()
return artifacts
def delete_by_document(self, doc_id: str) -> None:
"""Delete all processing rows for a document explicitly."""
with self._conn() as conn:
with conn.cursor() as cur:
cur.execute("DELETE FROM document_status_history WHERE doc_id = %s", (doc_id,))
cur.execute("DELETE FROM document_artifacts WHERE doc_id = %s", (doc_id,))
cur.execute("DELETE FROM document_processing_runs WHERE doc_id = %s", (doc_id,))
conn.commit()
def list_runs_by_document(self, doc_id: str) -> list[DocumentProcessingRun]:
"""List processing runs for a document in chronological order."""
sql = "SELECT * FROM document_processing_runs WHERE doc_id = %s ORDER BY started_at ASC"
with self._conn() as conn:
with conn.cursor(cursor_factory=psycopg2.extras.RealDictCursor) as cur:
cur.execute(sql, (doc_id,))
rows = cur.fetchall()
return [self._row_to_run(dict(row)) for row in rows]
def get_run(self, run_id: str) -> DocumentProcessingRun | None:
"""Return one processing run by identifier."""
sql = "SELECT * FROM document_processing_runs WHERE run_id = %s"
with self._conn() as conn:
with conn.cursor(cursor_factory=psycopg2.extras.RealDictCursor) as cur:
cur.execute(sql, (run_id,))
row = cur.fetchone()
return self._row_to_run(dict(row)) if row else None
def list_status_events_by_document(self, doc_id: str) -> list[DocumentStatusEvent]:
"""List all status events for a document."""
sql = "SELECT * FROM document_status_history WHERE doc_id = %s ORDER BY occurred_at ASC"
with self._conn() as conn:
with conn.cursor(cursor_factory=psycopg2.extras.RealDictCursor) as cur:
cur.execute(sql, (doc_id,))
rows = cur.fetchall()
return [self._row_to_event(dict(row)) for row in rows]
def list_status_events_by_run(self, run_id: str) -> list[DocumentStatusEvent]:
"""List all status events for a run."""
sql = "SELECT * FROM document_status_history WHERE run_id = %s ORDER BY occurred_at ASC"
with self._conn() as conn:
with conn.cursor(cursor_factory=psycopg2.extras.RealDictCursor) as cur:
cur.execute(sql, (run_id,))
rows = cur.fetchall()
return [self._row_to_event(dict(row)) for row in rows]
def list_artifacts_by_document(self, doc_id: str) -> list[DocumentArtifact]:
"""List all artifact references for a document."""
sql = "SELECT * FROM document_artifacts WHERE doc_id = %s ORDER BY created_at ASC"
with self._conn() as conn:
with conn.cursor(cursor_factory=psycopg2.extras.RealDictCursor) as cur:
cur.execute(sql, (doc_id,))
rows = cur.fetchall()
return [self._row_to_artifact(dict(row)) for row in rows]
def list_artifacts_by_run(self, run_id: str) -> list[DocumentArtifact]:
"""List all artifact references for a run."""
sql = "SELECT * FROM document_artifacts WHERE run_id = %s ORDER BY created_at ASC"
with self._conn() as conn:
with conn.cursor(cursor_factory=psycopg2.extras.RealDictCursor) as cur:
cur.execute(sql, (run_id,))
rows = cur.fetchall()
return [self._row_to_artifact(dict(row)) for row in rows]

View File

@@ -56,7 +56,21 @@ class BM25Retriever:
try:
rows = self._vector_index.collection.query(
expr='doc_id != ""',
output_fields=["id", "doc_id", "doc_name", "content", "section_title", "page_number"],
output_fields=[
"id",
"chunk_id",
"doc_id",
"doc_title",
"text",
"chunk_type",
"section_title",
"page_start",
"page_end",
"section_level",
"chunk_index",
"piece_index",
"metadata_json",
],
limit=16384,
)
except Exception:
@@ -64,19 +78,33 @@ class BM25Retriever:
return []
return [
RetrievedChunk(
chunk_id=str(row.get("id", "")),
chunk_id=str(row.get("chunk_id") or row.get("id", "")),
doc_id=str(row.get("doc_id", "")),
doc_name=str(row.get("doc_name", "")),
content=str(row.get("content", "")),
doc_title=str(row.get("doc_title", "")),
text=str(row.get("text", "")),
score=0.0,
chunk_type=str(row.get("chunk_type", "")),
section_title=str(row.get("section_title", "")),
page_number=int(row.get("page_number") or 0),
metadata={},
page_start=int(row.get("page_start") or 0),
page_end=int(row.get("page_end") or 0),
section_level=int(row.get("section_level") or 0),
chunk_index=int(row.get("chunk_index") or 0),
piece_index=int(row.get("piece_index") or 0),
metadata=self._parse_metadata_json(row.get("metadata_json", "")),
)
for row in rows
if row.get("content")
if row.get("text")
]
def _parse_metadata_json(self, raw_metadata: str) -> dict:
"""Parse metadata_json into a dict for BM25-side filtering."""
if not raw_metadata:
return {}
try:
return dict(__import__("json").loads(raw_metadata))
except Exception:
return {}
def _ensure_built(self) -> None:
if self._index is not None:
return
@@ -93,7 +121,7 @@ class BM25Retriever:
self._chunks = []
self._index = BM25Okapi([[]])
return
tokenized = [_tokenize(c.content) for c in chunks]
tokenized = [_tokenize(c.text) for c in chunks]
self._chunks = chunks
self._index = BM25Okapi(tokenized)
logger.info("BM25Retriever: index built with %d chunks", len(chunks))
@@ -127,20 +155,26 @@ class BM25Retriever:
for score, chunk in ranked[: top_k * 2]:
if score <= 0:
break
# Apply simple regulation_type filter if provided
if filters and chunk.metadata.get("regulation_type"):
types = [t.strip() for t in filters.split(",")]
if chunk.metadata.get("regulation_type") not in types:
continue
if filters:
normalized_filter = filters.replace("doc_name", "doc_title").strip()
if normalized_filter.startswith('doc_title == "'):
expected_title = normalized_filter[len('doc_title == "'):-1]
if chunk.doc_title != expected_title:
continue
results.append(
RetrievedChunk(
chunk_id=chunk.chunk_id,
doc_id=chunk.doc_id,
doc_name=chunk.doc_name,
content=chunk.content,
doc_title=chunk.doc_title,
text=chunk.text,
score=score,
chunk_type=chunk.chunk_type,
section_title=chunk.section_title,
page_number=chunk.page_number,
page_start=chunk.page_start,
page_end=chunk.page_end,
section_level=chunk.section_level,
chunk_index=chunk.chunk_index,
piece_index=chunk.piece_index,
metadata=chunk.metadata,
)
)

View File

@@ -31,7 +31,7 @@ class OpenAICompatibleReranker(Reranker):
if not chunks:
return []
texts = [chunk.content for chunk in chunks]
texts = [chunk.text for chunk in chunks]
start = time.time()
try:
scores = self._call_reranker(query, texts)

View File

@@ -4,57 +4,150 @@ from __future__ import annotations
import json
import time
from typing import Iterable
from pymilvus import Collection, CollectionSchema, DataType, FieldSchema, connections, utility
from loguru import logger
from pymilvus import Collection, CollectionSchema, DataType, FieldSchema, MilvusException, connections, utility
from app.config.settings import settings
from app.domain.documents import Chunk
from app.domain.retrieval import RetrievedChunk, VectorIndex
from app.shared.errors import VectorStoreSchemaError
# Keep adapter behavior explicit so integration details remain easy to audit.
_REQUIRED_SCHEMA_FIELDS = (
"doc_id",
"doc_title",
"chunk_id",
"text",
"embedding",
"section_title",
"metadata_json",
)
_SCHEMA_RECOVERY_TOKENS = (
"field doc_title not exist",
"field text not exist",
"field embedding not exist",
"collection not loaded",
"can't find collection",
"not found[collection",
)
class MilvusVectorIndex(VectorIndex):
"""Provide the Milvus Vector Index index implementation."""
def __init__(self) -> None:
"""Initialize the Milvus Vector Index instance."""
self.collection_name = settings.milvus_collection
self.db_name = settings.milvus_db_name
self.host = settings.milvus_host
self.port = settings.milvus_port
# Use an adapter-specific alias so this index never reuses unrelated global Milvus state.
self.alias = f"vector-index::{self.host}:{self.port}/{self.db_name}/{self.collection_name}"
self._connect()
self.collection = self._bind_collection()
def _connect(self, *, refresh: bool = False) -> None:
"""Establish the Milvus connection for this adapter."""
if refresh:
try:
connections.disconnect(self.alias)
except Exception:
# Best-effort disconnect keeps refresh idempotent when no alias is active yet.
pass
connections.connect(
alias="default",
host=settings.milvus_host,
port=settings.milvus_port,
alias=self.alias,
host=self.host,
port=self.port,
db_name=self.db_name,
)
self.collection = self._ensure_collection()
def _schema_field_names(self, collection: Collection) -> list[str]:
"""Return the field names exposed by the bound Milvus collection."""
return [field.name for field in collection.schema.fields]
def _raise_schema_error(self, *, message: str, actual_fields: Iterable[str]) -> None:
"""Raise a typed schema error for the active collection."""
raise VectorStoreSchemaError(
message=message,
host=self.host,
db_name=self.db_name,
collection_name=self.collection_name,
expected_fields=list(_REQUIRED_SCHEMA_FIELDS),
actual_fields=list(actual_fields),
)
def _validate_schema(self, collection: Collection) -> None:
"""Ensure the collection schema matches the dense-only adapter contract."""
actual_fields = self._schema_field_names(collection)
missing_fields = [field_name for field_name in _REQUIRED_SCHEMA_FIELDS if field_name not in actual_fields]
if missing_fields:
self._raise_schema_error(
message=f"Milvus collection schema mismatch; missing required fields: {missing_fields}",
actual_fields=actual_fields,
)
def _log_collection_binding(self, collection: Collection, *, event: str) -> None:
"""Record the bound collection details for runtime diagnostics."""
try:
num_entities = collection.num_entities
except Exception:
num_entities = "unknown"
logger.info(
"Milvus binding {} alias={} host={} db={} collection={} fields={} num_entities={}",
event,
self.alias,
self.host,
self.db_name,
self.collection_name,
self._schema_field_names(collection),
num_entities,
)
def _bind_collection(self, *, force_refresh: bool = False) -> Collection:
"""Bind and validate the configured Milvus collection."""
if force_refresh:
self._connect(refresh=True)
collection = self._ensure_collection()
self._validate_schema(collection)
self._log_collection_binding(collection, event="refreshed" if force_refresh else "initialized")
return collection
def _ensure_collection(self) -> Collection:
"""Handle ensure collection for this module for the Milvus Vector Index instance."""
if utility.has_collection(self.collection_name):
collection = Collection(self.collection_name)
if utility.has_collection(self.collection_name, using=self.alias):
collection = Collection(self.collection_name, using=self.alias)
collection.load()
return collection
schema = CollectionSchema(
fields=[
FieldSchema(name="id", dtype=DataType.VARCHAR, max_length=128, is_primary=True, auto_id=False),
FieldSchema(name="doc_id", dtype=DataType.VARCHAR, max_length=64),
FieldSchema(name="doc_name", dtype=DataType.VARCHAR, max_length=256),
FieldSchema(name="content", dtype=DataType.VARCHAR, max_length=65535),
FieldSchema(name="doc_title", dtype=DataType.VARCHAR, max_length=256),
FieldSchema(name="chunk_id", dtype=DataType.VARCHAR, max_length=128),
FieldSchema(name="chunk_index", dtype=DataType.INT64),
FieldSchema(name="piece_index", dtype=DataType.INT64),
FieldSchema(name="text", dtype=DataType.VARCHAR, max_length=65535),
FieldSchema(name="embedding_text", dtype=DataType.VARCHAR, max_length=65535),
FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=settings.embedding_dim),
FieldSchema(name="section_title", dtype=DataType.VARCHAR, max_length=512),
FieldSchema(name="section_path", dtype=DataType.VARCHAR, max_length=4096),
FieldSchema(name="page_number", dtype=DataType.INT64),
FieldSchema(name="regulation_type", dtype=DataType.VARCHAR, max_length=128),
FieldSchema(name="version", dtype=DataType.VARCHAR, max_length=64),
FieldSchema(name="semantic_id", dtype=DataType.VARCHAR, max_length=128),
FieldSchema(name="block_type", dtype=DataType.VARCHAR, max_length=64),
FieldSchema(name="chunk_type", dtype=DataType.VARCHAR, max_length=64),
FieldSchema(name="page_start", dtype=DataType.INT64),
FieldSchema(name="page_end", dtype=DataType.INT64),
FieldSchema(name="section_level", dtype=DataType.INT64),
FieldSchema(name="source_ids", dtype=DataType.VARCHAR, max_length=4096),
FieldSchema(name="section_path", dtype=DataType.VARCHAR, max_length=4096),
FieldSchema(name="section_title", dtype=DataType.VARCHAR, max_length=512),
FieldSchema(name="metadata_json", dtype=DataType.VARCHAR, max_length=65535),
FieldSchema(name="created_at", dtype=DataType.INT64),
],
description="Dense-only regulations index",
enable_dynamic_field=False,
)
collection = Collection(name=self.collection_name, schema=schema)
collection = Collection(name=self.collection_name, schema=schema, using=self.alias)
collection.create_index(
field_name="embedding",
index_params={
@@ -73,21 +166,34 @@ class MilvusVectorIndex(VectorIndex):
data = []
now = int(time.time())
for chunk, vector in zip(chunks, vectors):
metadata = dict(chunk.metadata)
doc_title = str(metadata.get("doc_title", chunk.doc_title))
text = str(metadata.get("text", chunk.text))
embedding_text = str(metadata.get("embedding_text", chunk.embedding_text))
page_start = int(metadata.get("page_start", 0) or 0)
page_end = int(metadata.get("page_end", 0) or 0)
section_path = metadata.get("section_path", chunk.section_path)
source_ids = metadata.get("source_ids", [])
data.append(
{
"id": chunk.chunk_id,
"doc_id": chunk.doc_id,
"doc_name": chunk.doc_name,
"content": chunk.content[:65535],
"doc_title": doc_title[:256],
"chunk_id": chunk.chunk_id[:128],
"chunk_index": int(metadata.get("chunk_index", chunk.chunk_index) or 0),
"piece_index": int(metadata.get("piece_index", chunk.piece_index) or 0),
"text": text[:65535],
"embedding_text": embedding_text[:65535],
"embedding": vector,
"section_title": chunk.section_title[:512],
"section_path": json.dumps(chunk.section_path, ensure_ascii=False)[:4096],
"page_number": chunk.page_number,
"regulation_type": chunk.regulation_type[:128],
"version": chunk.version[:64],
"semantic_id": chunk.semantic_id[:128],
"block_type": chunk.block_type[:64],
"metadata_json": json.dumps(chunk.metadata, ensure_ascii=False)[:65535],
"semantic_id": str(metadata.get("semantic_id", chunk.semantic_id))[:128],
"chunk_type": str(metadata.get("chunk_type", chunk.chunk_type))[:64],
"page_start": page_start,
"page_end": page_end,
"section_level": int(metadata.get("section_level", chunk.section_level) or 0),
"source_ids": json.dumps(source_ids, ensure_ascii=False)[:4096],
"section_path": json.dumps(section_path, ensure_ascii=False)[:4096],
"section_title": str(metadata.get("section_title", chunk.section_title))[:512],
"metadata_json": json.dumps(metadata, ensure_ascii=False)[:65535],
"created_at": now,
}
)
@@ -107,47 +213,97 @@ class MilvusVectorIndex(VectorIndex):
filters = filters.strip()
# Normalize legacy field names so callers can keep older filter payloads.
replacements = {
"doc_name": "doc_title",
"content": "text",
"page_number": "page_start",
"block_type": "chunk_type",
}
for legacy_name, new_name in replacements.items():
filters = filters.replace(legacy_name, new_name)
# Check if already a Milvus expression (contains operators)
if any(op in filters for op in ["==", "!=", "in", "not in", ">", "<", ">=", "<=", "and", "or"]):
return filters
# Parse simple regulation_type filter
# Support: "GB" or "GB,UN-ECE" or "GB, UN-ECE"
types = [t.strip() for t in filters.split(",") if t.strip()]
# Parse simple document-title filter.
titles = [title.strip() for title in filters.split(",") if title.strip()]
if not types:
if not titles:
return None
if len(types) == 1:
# Single value: regulation_type == "GB"
return f'regulation_type == "{types[0]}"'
else:
# Multiple values: regulation_type in ["GB", "UN-ECE"]
quoted_types = [f'"{t}"' for t in types]
return f'regulation_type in [{", ".join(quoted_types)}]'
if len(titles) == 1:
return f'doc_title == "{titles[0]}"'
quoted_titles = [f'"{title}"' for title in titles]
return f'doc_title in [{", ".join(quoted_titles)}]'
def _should_refresh_after_exception(self, exc: Exception) -> bool:
"""Return whether the Milvus error suggests stale connection or collection state."""
if not isinstance(exc, MilvusException):
return False
normalized = str(exc).lower()
return any(token in normalized for token in _SCHEMA_RECOVERY_TOKENS)
def _run_with_refresh(self, operation):
"""Run a Milvus operation and retry once after a forced reconnect when appropriate."""
try:
return operation()
except VectorStoreSchemaError:
raise
except Exception as exc:
if not self._should_refresh_after_exception(exc):
raise
logger.warning(
"Milvus operation failed for alias={} collection={}; forcing reconnect and retry: {}",
self.alias,
self.collection_name,
exc,
)
self.collection = self._bind_collection(force_refresh=True)
try:
return operation()
except VectorStoreSchemaError:
raise
except Exception as retry_exc:
if isinstance(retry_exc, MilvusException):
self._raise_schema_error(
message=f"Milvus operation failed after refresh: {retry_exc}",
actual_fields=self._schema_field_names(self.collection),
)
raise
def search(self, query_vector: list[float], top_k: int, filters: str | None = None) -> list[RetrievedChunk]:
"""Handle search for the Milvus Vector Index instance."""
milvus_expr = self._parse_filters(filters)
results = self.collection.search(
data=[query_vector],
anns_field="embedding",
param={"metric_type": "COSINE", "params": {"nprobe": settings.milvus_nprobe}},
limit=top_k,
expr=milvus_expr,
output_fields=[
"doc_id",
"doc_name",
"content",
"section_title",
"page_number",
"regulation_type",
"version",
"semantic_id",
"block_type",
"metadata_json",
],
results = self._run_with_refresh(
lambda: self.collection.search(
data=[query_vector],
anns_field="embedding",
param={"metric_type": "COSINE", "params": {"nprobe": settings.milvus_nprobe}},
limit=top_k,
expr=milvus_expr,
output_fields=[
"doc_id",
"doc_title",
"chunk_id",
"chunk_index",
"piece_index",
"text",
"embedding_text",
"section_title",
"semantic_id",
"chunk_type",
"page_start",
"page_end",
"section_level",
"source_ids",
"section_path",
"metadata_json",
],
)
)
payload: list[RetrievedChunk] = []
for hits in results:
@@ -161,13 +317,18 @@ class MilvusVectorIndex(VectorIndex):
metadata = {"raw_metadata": raw_metadata}
payload.append(
RetrievedChunk(
chunk_id=str(hit.id),
chunk_id=str(hit.entity.get("chunk_id", hit.id)),
doc_id=hit.entity.get("doc_id", ""),
doc_name=hit.entity.get("doc_name", ""),
content=hit.entity.get("content", ""),
doc_title=hit.entity.get("doc_title", ""),
text=hit.entity.get("text", ""),
score=float(hit.score),
chunk_type=hit.entity.get("chunk_type", ""),
section_title=hit.entity.get("section_title", ""),
page_number=int(hit.entity.get("page_number", 0) or 0),
page_start=int(hit.entity.get("page_start", 0) or 0),
page_end=int(hit.entity.get("page_end", 0) or 0),
section_level=int(hit.entity.get("section_level", 0) or 0),
chunk_index=int(hit.entity.get("chunk_index", 0) or 0),
piece_index=int(hit.entity.get("piece_index", 0) or 0),
metadata=metadata,
)
)
@@ -176,7 +337,9 @@ class MilvusVectorIndex(VectorIndex):
def count_by_document(self) -> dict[str, int]:
"""Return doc_id -> chunk count from Milvus."""
try:
rows = self.collection.query(expr="doc_id != \"\"", output_fields=["doc_id"])
rows = self._run_with_refresh(
lambda: self.collection.query(expr="doc_id != \"\"", output_fields=["doc_id", "doc_title"])
)
except Exception:
return {}
counts: dict[str, int] = {}
@@ -189,9 +352,11 @@ class MilvusVectorIndex(VectorIndex):
def list_document_metadata(self) -> list[dict]:
"""Return one metadata row per document from Milvus (single query, no embeddings)."""
try:
rows = self.collection.query(
expr="doc_id != \"\"",
output_fields=["doc_id", "doc_name", "regulation_type", "version"],
rows = self._run_with_refresh(
lambda: self.collection.query(
expr="doc_id != \"\"",
output_fields=["doc_id", "doc_title", "metadata_json"],
)
)
except Exception:
return []
@@ -204,15 +369,26 @@ class MilvusVectorIndex(VectorIndex):
continue
counts[doc_id] = counts.get(doc_id, 0) + 1
if doc_id not in seen:
metadata: dict[str, object] = {}
raw_metadata = row.get("metadata_json", "")
if raw_metadata:
try:
metadata = json.loads(raw_metadata)
except json.JSONDecodeError:
metadata = {}
seen[doc_id] = {
"doc_id": doc_id,
"doc_name": row.get("doc_name", ""),
"regulation_type": row.get("regulation_type", ""),
"version": row.get("version", ""),
"doc_title": row.get("doc_title", ""),
"regulation_type": str(metadata.get("regulation_type", "")),
"version": str(metadata.get("version", "")),
}
return [
{**meta, "chunk_count": counts[meta["doc_id"]]}
{
**meta,
"doc_name": meta.get("doc_title", ""),
"chunk_count": counts[meta["doc_id"]],
}
for meta in seen.values()
]

View File

@@ -67,14 +67,14 @@ class DocumentProcessor:
return [
{
"id": item.chunk_id,
"content": item.content,
"content": item.text,
"score": item.score,
"metadata": {
"doc_id": item.doc_id,
"doc_name": item.doc_name,
"doc_name": item.doc_title,
"chunk_id": item.chunk_id,
"section_title": item.section_title,
"page_number": item.page_number,
"page_number": item.page_start,
**item.metadata,
},
}

View File

@@ -3,29 +3,136 @@
from __future__ import annotations
from functools import lru_cache
from typing import Callable
from app.application.agent import AgentConversationService
from app.application.agent import AgentConversationService, AgentSessionService
from app.application.documents import DocumentCommandService, DocumentQueryService
from app.application.knowledge import KnowledgeRetrievalService
from app.application.perception.services import PerceptionService
from app.config.settings import settings
from app.domain.documents import DocumentBinaryStore
from app.domain.retrieval import VectorIndex
from app.infrastructure.embedding.openai_compatible_embedding_provider import OpenAICompatibleEmbeddingProvider
from app.infrastructure.llm.openai_compatible_answer_generator import OpenAICompatibleAnswerGenerator
from app.infrastructure.parser.aliyun_document_parser import AliyunDocumentParser
from app.infrastructure.parser.local_chunk_builder import LocalRegulationChunkBuilder
from app.infrastructure.parser.local_document_parser import LocalDocumentParser
from app.infrastructure.parser.vector_chunk_builder import AliyunVectorChunkBuilder
from app.infrastructure.perception.mock_event_store import MockEventStore
from app.infrastructure.session.in_memory_conversation_store import InMemoryConversationStore
from app.infrastructure.storage.json_document_processing_store import JsonDocumentProcessingStore
from app.infrastructure.storage.json_document_repository import JsonDocumentRepository
from app.infrastructure.storage.minio_binary_store import MinioDocumentBinaryStore
from app.infrastructure.storage.postgres_document_processing_store import PostgresDocumentProcessingStore
from app.infrastructure.storage.postgres_document_repository import PostgresDocumentRepository
from app.infrastructure.storage.postgres_parse_artifact_store import PostgresParseArtifactStore
from app.infrastructure.vectorstore.bm25_retriever import BM25Retriever
from app.infrastructure.vectorstore.cross_encoder_reranker import OpenAICompatibleReranker
from app.infrastructure.vectorstore.dense_retriever import DenseRetriever
from app.infrastructure.vectorstore.milvus_vector_index import MilvusVectorIndex
from app.infrastructure.vectorstore.cross_encoder_reranker import OpenAICompatibleReranker
from app.services.llm.llm_factory import LLMFactory
# Keep shared wiring centralized so dependency construction remains consistent.
class LazyBinaryStore(DocumentBinaryStore):
"""Delay MinIO connection work until binary storage is actually needed."""
def __init__(self, factory: Callable[[], DocumentBinaryStore]) -> None:
"""Initialize the lazy binary store wrapper."""
self._factory = factory
self._store: DocumentBinaryStore | None = None
def _get_store(self) -> DocumentBinaryStore:
"""Create the underlying store on first use and reuse it afterwards."""
if self._store is None:
self._store = self._factory()
return self._store
@property
def client(self):
"""Expose the underlying client for compatibility with health endpoints."""
return self._get_store().client
def save(
self,
*,
object_name: str,
data: bytes,
content_type: str,
metadata: dict[str, str] | None = None,
) -> None:
"""Save data through the underlying binary store implementation."""
self._get_store().save(
object_name=object_name,
data=data,
content_type=content_type,
metadata=metadata,
)
def read(self, object_name: str) -> bytes:
"""Read data through the underlying binary store implementation."""
return self._get_store().read(object_name)
def delete(self, object_name: str) -> None:
"""Delete data through the underlying binary store implementation."""
self._get_store().delete(object_name)
class LazyVectorIndex(VectorIndex):
"""Delay Milvus connection work until vector operations are actually needed."""
def __init__(self, factory: Callable[[], VectorIndex]) -> None:
"""Initialize the lazy vector index wrapper."""
self._factory = factory
self._index: VectorIndex | None = None
def _get_index(self) -> VectorIndex:
"""Create the underlying index on first use and reuse it afterwards."""
if self._index is None:
self._index = self._factory()
return self._index
@property
def collection(self):
"""Expose the underlying Milvus collection for compatibility adapters."""
return self._get_index().collection
def upsert(self, chunks, vectors) -> int:
"""Insert or update vectors through the underlying vector index implementation."""
return self._get_index().upsert(chunks, vectors)
def delete_by_document(self, doc_id: str) -> int:
"""Delete vectors through the underlying vector index implementation."""
return self._get_index().delete_by_document(doc_id)
def search(self, query_vector: list[float], top_k: int, filters: str | None = None):
"""Search vectors through the underlying vector index implementation."""
return self._get_index().search(query_vector, top_k, filters)
def count_by_document(self) -> dict[str, int]:
"""Count document vectors through the underlying vector index implementation."""
return self._get_index().count_by_document()
def list_document_metadata(self) -> list[dict]:
"""List document metadata through the underlying vector index implementation."""
return self._get_index().list_document_metadata()
def health(self) -> dict:
"""Return vector index health through the underlying vector index implementation."""
return self._get_index().health()
@lru_cache
def _build_binary_store() -> MinioDocumentBinaryStore:
"""Return the concrete binary store implementation."""
return MinioDocumentBinaryStore()
@lru_cache
def _build_vector_index() -> MilvusVectorIndex:
"""Return the concrete vector index implementation."""
return MilvusVectorIndex()
@lru_cache
def get_document_repository():
@@ -44,9 +151,17 @@ def get_parse_artifact_store():
@lru_cache
def get_binary_store() -> MinioDocumentBinaryStore:
def get_document_processing_store():
"""Return document processing store for the active repository backend."""
if settings.document_repository_backend == "postgres":
return PostgresDocumentProcessingStore()
return JsonDocumentProcessingStore(settings.document_processing_metadata_path)
@lru_cache
def get_binary_store() -> DocumentBinaryStore:
"""Return binary store."""
return MinioDocumentBinaryStore()
return LazyBinaryStore(_build_binary_store)
@lru_cache
@@ -75,9 +190,9 @@ def get_embedding_provider() -> OpenAICompatibleEmbeddingProvider:
@lru_cache
def get_vector_index() -> MilvusVectorIndex:
def get_vector_index() -> VectorIndex:
"""Return vector index."""
return MilvusVectorIndex()
return LazyVectorIndex(_build_vector_index)
@lru_cache
@@ -121,6 +236,7 @@ def get_document_command_service() -> DocumentCommandService:
embedding_provider=get_embedding_provider(),
vector_index=get_vector_index(),
parse_artifact_store=get_parse_artifact_store(),
document_processing_store=get_document_processing_store(),
)
@@ -151,3 +267,28 @@ def get_agent_conversation_service() -> AgentConversationService:
answer_generator=OpenAICompatibleAnswerGenerator(),
conversation_store=get_conversation_store(),
)
@lru_cache
def get_perception_service() -> PerceptionService:
"""Return perception service for regulatory intelligence."""
return PerceptionService(
event_store=MockEventStore(),
retrieval_service=get_retrieval_service(),
)
@lru_cache
def get_agent_session_service() -> AgentSessionService:
"""Return agent session service."""
return AgentSessionService(conversation_store=get_conversation_store())
def preload_runtime_dependencies() -> None:
"""Warm dependencies that are safe and useful to preload during startup."""
LLMFactory.preload_clients(["qwen", "deepseek"])
def cleanup_runtime_dependencies() -> None:
"""Release runtime dependencies that expose explicit cleanup hooks."""
LLMFactory.cleanup()

View File

@@ -0,0 +1,30 @@
"""Define shared backend exception types."""
from __future__ import annotations
class VectorStoreSchemaError(RuntimeError):
"""Signal that the active vector store schema does not match backend expectations."""
def __init__(
self,
*,
message: str,
host: str,
db_name: str,
collection_name: str,
expected_fields: list[str],
actual_fields: list[str],
) -> None:
"""Initialize the vector store schema error details."""
self.host = host
self.db_name = db_name
self.collection_name = collection_name
self.expected_fields = expected_fields
self.actual_fields = actual_fields
# Keep the message self-contained so runtime logs show the full mismatch context.
details = (
f"{message} | host={host} db={db_name} collection={collection_name} "
f"expected_fields={expected_fields} actual_fields={actual_fields}"
)
super().__init__(details)

View File

@@ -1 +0,0 @@
{}

View File

@@ -0,0 +1,131 @@
{
"runs": {
"8e722053-5009-40fe-a483-535b40ebbb16": {
"run_id": "8e722053-5009-40fe-a483-535b40ebbb16",
"doc_id": "7cbdfe3c",
"trigger_type": "upload",
"run_status": "succeeded",
"parser_backend": "aliyun_docmind",
"chunk_backend": "aliyun",
"embedding_model": "text-embedding-v3",
"index_name": "regulations_dense_1024_v2",
"started_at": "2026-05-26T12:18:27.208692+00:00",
"stored_at": "2026-05-26T12:18:27.712855+00:00",
"parsed_at": "2026-05-26T12:18:42.989238+00:00",
"indexed_at": "2026-05-26T12:18:51.172418+00:00",
"finished_at": "2026-05-26T12:18:51.172418+00:00",
"layout_count": 48,
"structure_node_count": 6,
"semantic_block_count": 33,
"vector_chunk_count": 34,
"chunk_count": 34,
"failure_stage": "",
"error_message": "",
"metadata": {
"generate_summary": true,
"parse_task_id": "docmind-20260526-10b94713ccb348498b12180a5dcf32ff"
}
}
},
"status_events": {
"d0532baf-0d65-4130-b282-ec51f04132fd": {
"event_id": "d0532baf-0d65-4130-b282-ec51f04132fd",
"doc_id": "7cbdfe3c",
"run_id": "8e722053-5009-40fe-a483-535b40ebbb16",
"from_status": "",
"to_status": "pending",
"stage": "document_created",
"message": "Document record created",
"metadata": {},
"occurred_at": "2026-05-26T12:18:27.235921+00:00"
},
"a5e32db5-25c3-4c73-a987-7311f0e72a31": {
"event_id": "a5e32db5-25c3-4c73-a987-7311f0e72a31",
"doc_id": "7cbdfe3c",
"run_id": "8e722053-5009-40fe-a483-535b40ebbb16",
"from_status": "pending",
"to_status": "stored",
"stage": "store",
"message": "Source file stored",
"metadata": {},
"occurred_at": "2026-05-26T12:18:27.741462+00:00"
},
"18e04ce7-9d7a-4008-8600-e2590100bd85": {
"event_id": "18e04ce7-9d7a-4008-8600-e2590100bd85",
"doc_id": "7cbdfe3c",
"run_id": "8e722053-5009-40fe-a483-535b40ebbb16",
"from_status": "stored",
"to_status": "parsed",
"stage": "parse",
"message": "Document parsed",
"metadata": {
"artifact_count": 4
},
"occurred_at": "2026-05-26T12:18:43.218026+00:00"
},
"d3b06025-5c91-4a42-9e5f-dce1c5312b96": {
"event_id": "d3b06025-5c91-4a42-9e5f-dce1c5312b96",
"doc_id": "7cbdfe3c",
"run_id": "8e722053-5009-40fe-a483-535b40ebbb16",
"from_status": "parsed",
"to_status": "indexed",
"stage": "index",
"message": "Document indexed",
"metadata": {
"chunk_count": 34,
"index_name": "regulations_dense_1024_v2"
},
"occurred_at": "2026-05-26T12:18:51.195442+00:00"
}
},
"artifacts": {
"47fe2877-a8f5-4e1d-901b-80cd0194ba96": {
"artifact_id": "47fe2877-a8f5-4e1d-901b-80cd0194ba96",
"doc_id": "7cbdfe3c",
"run_id": "8e722053-5009-40fe-a483-535b40ebbb16",
"artifact_type": "layouts",
"object_name": "artifacts/7cbdfe3c/layouts.json",
"content_type": "application/json",
"byte_size": 0,
"checksum": "",
"metadata": {},
"created_at": "2026-05-26T12:18:43.188467+00:00"
},
"44aa075b-86b2-48a7-9d14-a2453bd53863": {
"artifact_id": "44aa075b-86b2-48a7-9d14-a2453bd53863",
"doc_id": "7cbdfe3c",
"run_id": "8e722053-5009-40fe-a483-535b40ebbb16",
"artifact_type": "structure_nodes",
"object_name": "artifacts/7cbdfe3c/structure_nodes.json",
"content_type": "application/json",
"byte_size": 0,
"checksum": "",
"metadata": {},
"created_at": "2026-05-26T12:18:43.188494+00:00"
},
"dedcc8fe-fa58-4de6-984d-f44332af5204": {
"artifact_id": "dedcc8fe-fa58-4de6-984d-f44332af5204",
"doc_id": "7cbdfe3c",
"run_id": "8e722053-5009-40fe-a483-535b40ebbb16",
"artifact_type": "semantic_blocks",
"object_name": "artifacts/7cbdfe3c/semantic_blocks.json",
"content_type": "application/json",
"byte_size": 0,
"checksum": "",
"metadata": {},
"created_at": "2026-05-26T12:18:43.188511+00:00"
},
"9b0d8bda-e69e-4a4e-ae06-a308afe43109": {
"artifact_id": "9b0d8bda-e69e-4a4e-ae06-a308afe43109",
"doc_id": "7cbdfe3c",
"run_id": "8e722053-5009-40fe-a483-535b40ebbb16",
"artifact_type": "vector_chunks",
"object_name": "artifacts/7cbdfe3c/vector_chunks.json",
"content_type": "application/json",
"byte_size": 0,
"checksum": "",
"metadata": {},
"created_at": "2026-05-26T12:18:43.188526+00:00"
}
}
}

View File

@@ -1,385 +1,38 @@
{
"69280841": {
"doc_id": "69280841",
"doc_name": "TCT算法接口.pdf",
"file_name": "TCT算法接口.pdf",
"object_name": "69280841/TCT算法接口.pdf",
"content_type": "application/pdf",
"size_bytes": 165557,
"status": "failed",
"regulation_type": "",
"version": "",
"summary": "",
"summary_latency_ms": 0,
"chunk_count": 0,
"parser_name": "local_markdown_parser",
"index_name": "",
"error_message": "embedding 维度不匹配,期望 1536",
"created_at": "2026-05-18T07:12:16.668306+00:00",
"updated_at": "2026-05-18T07:12:19.417142+00:00",
"metadata": {
"generate_summary": true,
"structure_nodes": 0
}
},
"44121fbb": {
"doc_id": "44121fbb",
"doc_name": "大众汽车手册.pdf",
"file_name": "大众汽车手册.pdf",
"object_name": "44121fbb/大众汽车手册.pdf",
"content_type": "application/pdf",
"size_bytes": 766565,
"status": "failed",
"regulation_type": "",
"version": "",
"summary": "",
"summary_latency_ms": 0,
"chunk_count": 0,
"parser_name": "",
"index_name": "",
"error_message": "unable to load credentials from any of the providers in the chain: ['EnvironmentVariableCredentialsProvider: Environment variable accessKeyId cannot be empty', 'CLIProfileCredentialsProvider: unable to open credentials file: C:\\\\Users\\\\A200477427\\\\.aliyun/config.json', 'ProfileCredentialsProvider: failed to get credential from credentials file: $C:\\\\Users\\\\A200477427\\\\.alibabacloud/credentials.ini', \"EcsRamRoleCredentialsProvider: HTTPConnectionPool(host='100.100.100.200', port=80): Max retries exceeded with url: /latest/meta-data/ram/security-credentials/ (Caused by ConnectTimeoutError(<HTTPConnection(host='100.100.100.200', port=80) at 0x2614a5cb9d0>, 'Connection to 100.100.100.200 timed out. (connect timeout=1.0)'))\"]",
"created_at": "2026-05-18T09:53:47.996183+00:00",
"updated_at": "2026-05-18T09:53:50.825868+00:00",
"metadata": {
"generate_summary": true,
"failure_reason": "unable to load credentials from any of the providers in the chain: ['EnvironmentVariableCredentialsProvider: Environment variable accessKeyId cannot be empty', 'CLIProfileCredentialsProvider: unable to open credentials file: C:\\\\Users\\\\A200477427\\\\.aliyun/config.json', 'ProfileCredentialsProvider: failed to get credential from credentials file: $C:\\\\Users\\\\A200477427\\\\.alibabacloud/credentials.ini', \"EcsRamRoleCredentialsProvider: HTTPConnectionPool(host='100.100.100.200', port=80): Max retries exceeded with url: /latest/meta-data/ram/security-credentials/ (Caused by ConnectTimeoutError(<HTTPConnection(host='100.100.100.200', port=80) at 0x2614a5cb9d0>, 'Connection to 100.100.100.200 timed out. (connect timeout=1.0)'))\"]",
"processing_stage": "failed"
}
},
"77debb4a": {
"doc_id": "77debb4a",
"doc_name": "大众汽车手册.pdf",
"file_name": "大众汽车手册.pdf",
"object_name": "77debb4a/大众汽车手册.pdf",
"content_type": "application/pdf",
"size_bytes": 766565,
"status": "failed",
"regulation_type": "",
"version": "",
"summary": "",
"summary_latency_ms": 0,
"chunk_count": 0,
"parser_name": "",
"index_name": "",
"error_message": "unable to load credentials from any of the providers in the chain: ['EnvironmentVariableCredentialsProvider: Environment variable accessKeyId cannot be empty', 'CLIProfileCredentialsProvider: unable to open credentials file: C:\\\\Users\\\\A200477427\\\\.aliyun/config.json', 'ProfileCredentialsProvider: failed to get credential from credentials file: $C:\\\\Users\\\\A200477427\\\\.alibabacloud/credentials.ini', \"EcsRamRoleCredentialsProvider: HTTPConnectionPool(host='100.100.100.200', port=80): Max retries exceeded with url: /latest/meta-data/ram/security-credentials/ (Caused by ConnectTimeoutError(<HTTPConnection(host='100.100.100.200', port=80) at 0x2614a6dd480>, 'Connection to 100.100.100.200 timed out. (connect timeout=1.0)'))\"]",
"created_at": "2026-05-18T10:05:46.104259+00:00",
"updated_at": "2026-05-18T10:05:48.704061+00:00",
"metadata": {
"generate_summary": true,
"failure_reason": "unable to load credentials from any of the providers in the chain: ['EnvironmentVariableCredentialsProvider: Environment variable accessKeyId cannot be empty', 'CLIProfileCredentialsProvider: unable to open credentials file: C:\\\\Users\\\\A200477427\\\\.aliyun/config.json', 'ProfileCredentialsProvider: failed to get credential from credentials file: $C:\\\\Users\\\\A200477427\\\\.alibabacloud/credentials.ini', \"EcsRamRoleCredentialsProvider: HTTPConnectionPool(host='100.100.100.200', port=80): Max retries exceeded with url: /latest/meta-data/ram/security-credentials/ (Caused by ConnectTimeoutError(<HTTPConnection(host='100.100.100.200', port=80) at 0x2614a6dd480>, 'Connection to 100.100.100.200 timed out. (connect timeout=1.0)'))\"]",
"processing_stage": "failed"
}
},
"d12bdcc8": {
"doc_id": "d12bdcc8",
"doc_name": "TCT算法接口.pdf",
"file_name": "TCT算法接口.pdf",
"object_name": "d12bdcc8/TCT算法接口.pdf",
"content_type": "application/pdf",
"size_bytes": 165557,
"status": "failed",
"regulation_type": "",
"version": "",
"summary": "",
"summary_latency_ms": 0,
"chunk_count": 0,
"parser_name": "",
"index_name": "",
"error_message": "unable to load credentials from any of the providers in the chain: ['EnvironmentVariableCredentialsProvider: Environment variable accessKeyId cannot be empty', 'CLIProfileCredentialsProvider: unable to open credentials file: C:\\\\Users\\\\A200477427\\\\.aliyun/config.json', 'ProfileCredentialsProvider: failed to get credential from credentials file: $C:\\\\Users\\\\A200477427\\\\.alibabacloud/credentials.ini', \"EcsRamRoleCredentialsProvider: HTTPConnectionPool(host='100.100.100.200', port=80): Max retries exceeded with url: /latest/meta-data/ram/security-credentials/ (Caused by ConnectTimeoutError(<HTTPConnection(host='100.100.100.200', port=80) at 0x2614a5bf570>, 'Connection to 100.100.100.200 timed out. (connect timeout=1.0)'))\"]",
"created_at": "2026-05-18T10:07:22.199824+00:00",
"updated_at": "2026-05-18T10:07:24.653751+00:00",
"metadata": {
"generate_summary": true,
"failure_reason": "unable to load credentials from any of the providers in the chain: ['EnvironmentVariableCredentialsProvider: Environment variable accessKeyId cannot be empty', 'CLIProfileCredentialsProvider: unable to open credentials file: C:\\\\Users\\\\A200477427\\\\.aliyun/config.json', 'ProfileCredentialsProvider: failed to get credential from credentials file: $C:\\\\Users\\\\A200477427\\\\.alibabacloud/credentials.ini', \"EcsRamRoleCredentialsProvider: HTTPConnectionPool(host='100.100.100.200', port=80): Max retries exceeded with url: /latest/meta-data/ram/security-credentials/ (Caused by ConnectTimeoutError(<HTTPConnection(host='100.100.100.200', port=80) at 0x2614a5bf570>, 'Connection to 100.100.100.200 timed out. (connect timeout=1.0)'))\"]",
"processing_stage": "failed"
}
},
"3c2e8c9c": {
"doc_id": "3c2e8c9c",
"doc_name": "20260415_Continental tire mobile app solution.pdf",
"file_name": "20260415_Continental tire mobile app solution.pdf",
"object_name": "3c2e8c9c/20260415_Continental tire mobile app solution.pdf",
"content_type": "application/pdf",
"size_bytes": 2178074,
"status": "failed",
"regulation_type": "",
"version": "",
"summary": "",
"summary_latency_ms": 0,
"chunk_count": 0,
"parser_name": "",
"index_name": "",
"error_message": "unable to load credentials from any of the providers in the chain: ['EnvironmentVariableCredentialsProvider: Environment variable accessKeyId cannot be empty', 'CLIProfileCredentialsProvider: unable to open credentials file: C:\\\\Users\\\\A200477427\\\\.aliyun/config.json', 'ProfileCredentialsProvider: failed to get credential from credentials file: $C:\\\\Users\\\\A200477427\\\\.alibabacloud/credentials.ini', \"EcsRamRoleCredentialsProvider: HTTPConnectionPool(host='100.100.100.200', port=80): Max retries exceeded with url: /latest/meta-data/ram/security-credentials/ (Caused by ConnectTimeoutError(<HTTPConnection(host='100.100.100.200', port=80) at 0x2614a5bc8d0>, 'Connection to 100.100.100.200 timed out. (connect timeout=1.0)'))\"]",
"created_at": "2026-05-18T10:09:58.338274+00:00",
"updated_at": "2026-05-18T10:10:01.295502+00:00",
"metadata": {
"generate_summary": true,
"failure_reason": "unable to load credentials from any of the providers in the chain: ['EnvironmentVariableCredentialsProvider: Environment variable accessKeyId cannot be empty', 'CLIProfileCredentialsProvider: unable to open credentials file: C:\\\\Users\\\\A200477427\\\\.aliyun/config.json', 'ProfileCredentialsProvider: failed to get credential from credentials file: $C:\\\\Users\\\\A200477427\\\\.alibabacloud/credentials.ini', \"EcsRamRoleCredentialsProvider: HTTPConnectionPool(host='100.100.100.200', port=80): Max retries exceeded with url: /latest/meta-data/ram/security-credentials/ (Caused by ConnectTimeoutError(<HTTPConnection(host='100.100.100.200', port=80) at 0x2614a5bc8d0>, 'Connection to 100.100.100.200 timed out. (connect timeout=1.0)'))\"]",
"processing_stage": "failed"
}
},
"d22d21a0": {
"doc_id": "d22d21a0",
"doc_name": "20260415_Continental tire mobile app solution.pdf",
"file_name": "20260415_Continental tire mobile app solution.pdf",
"object_name": "d22d21a0/20260415_Continental tire mobile app solution.pdf",
"content_type": "application/pdf",
"size_bytes": 2178074,
"status": "failed",
"regulation_type": "",
"version": "",
"summary": "",
"summary_latency_ms": 0,
"chunk_count": 0,
"parser_name": "",
"index_name": "",
"error_message": "unable to load credentials from any of the providers in the chain: ['EnvironmentVariableCredentialsProvider: Environment variable accessKeyId cannot be empty', 'CLIProfileCredentialsProvider: unable to open credentials file: C:\\\\Users\\\\A200477427\\\\.aliyun/config.json', 'ProfileCredentialsProvider: failed to get credential from credentials file: $C:\\\\Users\\\\A200477427\\\\.alibabacloud/credentials.ini', \"EcsRamRoleCredentialsProvider: HTTPConnectionPool(host='100.100.100.200', port=80): Max retries exceeded with url: /latest/meta-data/ram/security-credentials/ (Caused by ConnectTimeoutError(<HTTPConnection(host='100.100.100.200', port=80) at 0x2614b994160>, 'Connection to 100.100.100.200 timed out. (connect timeout=1.0)'))\"]",
"created_at": "2026-05-18T10:12:20.078027+00:00",
"updated_at": "2026-05-18T10:12:22.999843+00:00",
"metadata": {
"generate_summary": true,
"failure_reason": "unable to load credentials from any of the providers in the chain: ['EnvironmentVariableCredentialsProvider: Environment variable accessKeyId cannot be empty', 'CLIProfileCredentialsProvider: unable to open credentials file: C:\\\\Users\\\\A200477427\\\\.aliyun/config.json', 'ProfileCredentialsProvider: failed to get credential from credentials file: $C:\\\\Users\\\\A200477427\\\\.alibabacloud/credentials.ini', \"EcsRamRoleCredentialsProvider: HTTPConnectionPool(host='100.100.100.200', port=80): Max retries exceeded with url: /latest/meta-data/ram/security-credentials/ (Caused by ConnectTimeoutError(<HTTPConnection(host='100.100.100.200', port=80) at 0x2614b994160>, 'Connection to 100.100.100.200 timed out. (connect timeout=1.0)'))\"]",
"processing_stage": "failed"
}
},
"35f129d3": {
"doc_id": "35f129d3",
"doc_name": "大众汽车手册.pdf",
"file_name": "大众汽车手册.pdf",
"object_name": "35f129d3/大众汽车手册.pdf",
"content_type": "application/pdf",
"size_bytes": 766565,
"status": "failed",
"regulation_type": "",
"version": "",
"summary": "",
"summary_latency_ms": 0,
"chunk_count": 0,
"parser_name": "",
"index_name": "",
"error_message": "unable to load credentials from any of the providers in the chain: ['EnvironmentVariableCredentialsProvider: Environment variable accessKeyId cannot be empty', 'CLIProfileCredentialsProvider: unable to open credentials file: C:\\\\Users\\\\A200477427\\\\.aliyun/config.json', 'ProfileCredentialsProvider: failed to get credential from credentials file: $C:\\\\Users\\\\A200477427\\\\.alibabacloud/credentials.ini', \"EcsRamRoleCredentialsProvider: HTTPConnectionPool(host='100.100.100.200', port=80): Max retries exceeded with url: /latest/meta-data/ram/security-credentials/ (Caused by ConnectTimeoutError(<HTTPConnection(host='100.100.100.200', port=80) at 0x2614b995370>, 'Connection to 100.100.100.200 timed out. (connect timeout=1.0)'))\"]",
"created_at": "2026-05-18T10:13:24.706512+00:00",
"updated_at": "2026-05-18T10:13:27.180509+00:00",
"metadata": {
"generate_summary": true,
"failure_reason": "unable to load credentials from any of the providers in the chain: ['EnvironmentVariableCredentialsProvider: Environment variable accessKeyId cannot be empty', 'CLIProfileCredentialsProvider: unable to open credentials file: C:\\\\Users\\\\A200477427\\\\.aliyun/config.json', 'ProfileCredentialsProvider: failed to get credential from credentials file: $C:\\\\Users\\\\A200477427\\\\.alibabacloud/credentials.ini', \"EcsRamRoleCredentialsProvider: HTTPConnectionPool(host='100.100.100.200', port=80): Max retries exceeded with url: /latest/meta-data/ram/security-credentials/ (Caused by ConnectTimeoutError(<HTTPConnection(host='100.100.100.200', port=80) at 0x2614b995370>, 'Connection to 100.100.100.200 timed out. (connect timeout=1.0)'))\"]",
"processing_stage": "failed"
}
},
"efc21515": {
"doc_id": "efc21515",
"doc_name": "大众汽车手册.pdf",
"file_name": "大众汽车手册.pdf",
"object_name": "efc21515/大众汽车手册.pdf",
"content_type": "application/pdf",
"size_bytes": 766565,
"status": "failed",
"regulation_type": "",
"version": "",
"summary": "",
"summary_latency_ms": 0,
"chunk_count": 0,
"parser_name": "aliyun_docmind",
"index_name": "",
"error_message": "Client error '400 Bad Request' for url 'http://6.86.80.4:30080/v1/embeddings'\nFor more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/400",
"created_at": "2026-05-18T13:47:32.076786+00:00",
"updated_at": "2026-05-18T13:47:57.998073+00:00",
"metadata": {
"generate_summary": true,
"parser_backend": "aliyun_docmind",
"parse_task_id": "docmind-20260518-a6e84447457f43cb85f95225cfc6495b",
"layout_count": 87,
"structure_node_count": 20,
"semantic_block_count": 27,
"vector_chunk_count": 27,
"artifact_keys": {
"layouts": "artifacts/efc21515/layouts.json",
"structure_nodes": "artifacts/efc21515/structure_nodes.json",
"semantic_blocks": "artifacts/efc21515/semantic_blocks.json",
"vector_chunks": "artifacts/efc21515/vector_chunks.json"
},
"processing_stage": "failed",
"failure_reason": "Client error '400 Bad Request' for url 'http://6.86.80.4:30080/v1/embeddings'\nFor more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/400"
}
},
"0d4b08bc": {
"doc_id": "0d4b08bc",
"doc_name": "大众汽车手册.pdf",
"file_name": "大众汽车手册.pdf",
"object_name": "0d4b08bc/大众汽车手册.pdf",
"content_type": "application/pdf",
"size_bytes": 766565,
"status": "failed",
"regulation_type": "",
"version": "",
"summary": "",
"summary_latency_ms": 0,
"chunk_count": 0,
"parser_name": "aliyun_docmind",
"index_name": "",
"error_message": "Client error '404 Not Found' for url 'http://6.86.80.4:30080/v1/embeddings'\nFor more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/404",
"created_at": "2026-05-18T14:03:15.134344+00:00",
"updated_at": "2026-05-18T14:03:34.843448+00:00",
"metadata": {
"generate_summary": true,
"parser_backend": "aliyun_docmind",
"parse_task_id": "docmind-20260518-78353d85daa24147b68d8fb71895179f",
"layout_count": 87,
"structure_node_count": 20,
"semantic_block_count": 27,
"vector_chunk_count": 27,
"artifact_keys": {
"layouts": "artifacts/0d4b08bc/layouts.json",
"structure_nodes": "artifacts/0d4b08bc/structure_nodes.json",
"semantic_blocks": "artifacts/0d4b08bc/semantic_blocks.json",
"vector_chunks": "artifacts/0d4b08bc/vector_chunks.json"
},
"processing_stage": "failed",
"failure_reason": "Client error '404 Not Found' for url 'http://6.86.80.4:30080/v1/embeddings'\nFor more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/404"
}
},
"4302f314": {
"doc_id": "4302f314",
"doc_name": "大众汽车手册.pdf",
"file_name": "大众汽车手册.pdf",
"object_name": "4302f314/大众汽车手册.pdf",
"content_type": "application/pdf",
"size_bytes": 766565,
"status": "failed",
"regulation_type": "",
"version": "",
"summary": "",
"summary_latency_ms": 0,
"chunk_count": 0,
"parser_name": "aliyun_docmind",
"index_name": "",
"error_message": "embedding 维度不匹配,期望 1536",
"created_at": "2026-05-18T14:11:29.943973+00:00",
"updated_at": "2026-05-18T14:11:48.554500+00:00",
"metadata": {
"generate_summary": true,
"parser_backend": "aliyun_docmind",
"parse_task_id": "docmind-20260518-23935ee455ac4b26ac4201ac4781ee52",
"layout_count": 87,
"structure_node_count": 20,
"semantic_block_count": 27,
"vector_chunk_count": 27,
"artifact_keys": {
"layouts": "artifacts/4302f314/layouts.json",
"structure_nodes": "artifacts/4302f314/structure_nodes.json",
"semantic_blocks": "artifacts/4302f314/semantic_blocks.json",
"vector_chunks": "artifacts/4302f314/vector_chunks.json"
},
"processing_stage": "failed",
"failure_reason": "embedding 维度不匹配,期望 1536"
}
},
"765ed1ee": {
"doc_id": "765ed1ee",
"doc_name": "大众汽车手册.pdf",
"file_name": "大众汽车手册.pdf",
"object_name": "765ed1ee/大众汽车手册.pdf",
"content_type": "application/pdf",
"size_bytes": 766565,
"status": "failed",
"regulation_type": "",
"version": "",
"summary": "",
"summary_latency_ms": 0,
"chunk_count": 0,
"parser_name": "aliyun_docmind",
"index_name": "",
"error_message": "<MilvusException: (code=1100, message=the dim (1024) of field data(embedding) is not equal to schema dim (1536): invalid parameter[expected=1536][actual=1024])>",
"created_at": "2026-05-18T14:18:28.875138+00:00",
"updated_at": "2026-05-18T14:18:57.389110+00:00",
"metadata": {
"generate_summary": true,
"parser_backend": "aliyun_docmind",
"parse_task_id": "docmind-20260518-f116856bc29245baa2531b245078a701",
"layout_count": 87,
"structure_node_count": 20,
"semantic_block_count": 27,
"vector_chunk_count": 27,
"artifact_keys": {
"layouts": "artifacts/765ed1ee/layouts.json",
"structure_nodes": "artifacts/765ed1ee/structure_nodes.json",
"semantic_blocks": "artifacts/765ed1ee/semantic_blocks.json",
"vector_chunks": "artifacts/765ed1ee/vector_chunks.json"
},
"processing_stage": "failed",
"failure_reason": "<MilvusException: (code=1100, message=the dim (1024) of field data(embedding) is not equal to schema dim (1536): invalid parameter[expected=1536][actual=1024])>"
}
},
"05cabe09": {
"doc_id": "05cabe09",
"doc_name": "大众汽车手册.pdf",
"file_name": "大众汽车手册.pdf",
"object_name": "05cabe09/大众汽车手册.pdf",
"content_type": "application/pdf",
"size_bytes": 766565,
"status": "failed",
"regulation_type": "",
"version": "",
"summary": "",
"summary_latency_ms": 0,
"chunk_count": 0,
"parser_name": "aliyun_docmind",
"index_name": "",
"error_message": "embedding 维度不匹配,期望 1536",
"created_at": "2026-05-18T14:24:32.156500+00:00",
"updated_at": "2026-05-18T14:24:50.114138+00:00",
"metadata": {
"generate_summary": true,
"parser_backend": "aliyun_docmind",
"parse_task_id": "docmind-20260518-897d858983df48e28e9819e563d46208",
"layout_count": 87,
"structure_node_count": 20,
"semantic_block_count": 27,
"vector_chunk_count": 27,
"artifact_keys": {
"layouts": "artifacts/05cabe09/layouts.json",
"structure_nodes": "artifacts/05cabe09/structure_nodes.json",
"semantic_blocks": "artifacts/05cabe09/semantic_blocks.json",
"vector_chunks": "artifacts/05cabe09/vector_chunks.json"
},
"processing_stage": "failed",
"failure_reason": "embedding 维度不匹配,期望 1536"
}
},
"9acb2ba0": {
"doc_id": "9acb2ba0",
"doc_name": "大众汽车手册.pdf",
"file_name": "大众汽车手册.pdf",
"object_name": "9acb2ba0/大众汽车手册.pdf",
"content_type": "application/pdf",
"size_bytes": 766565,
"7cbdfe3c": {
"doc_id": "7cbdfe3c",
"doc_name": "使用RSA Token连接CheckPoint VPN及PIN码设置_220.181.114.93 or 10.25.134.3.docx",
"file_name": "使用RSA Token连接CheckPoint VPN及PIN码设置_220.181.114.93 or 10.25.134.3.docx",
"object_name": "7cbdfe3c/使用RSA Token连接CheckPoint VPN及PIN码设置_220.181.114.93 or 10.25.134.3.docx",
"content_type": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
"size_bytes": 1199920,
"status": "indexed",
"regulation_type": "",
"version": "",
"summary": "",
"summary_latency_ms": 0,
"chunk_count": 27,
"chunk_count": 34,
"parser_name": "aliyun_docmind",
"index_name": "regulations_dense_1024_v1",
"index_name": "regulations_dense_1024_v2",
"error_message": "",
"created_at": "2026-05-18T14:29:01.368719+00:00",
"updated_at": "2026-05-18T14:29:23.699068+00:00",
"created_at": "2026-05-26T12:18:27.206125+00:00",
"updated_at": "2026-05-26T12:18:51.171308+00:00",
"metadata": {
"generate_summary": true,
"parser_backend": "aliyun_docmind",
"parse_task_id": "docmind-20260518-e5fd4a5419e74d569c562e389e6ae72c",
"layout_count": 87,
"structure_node_count": 20,
"semantic_block_count": 27,
"vector_chunk_count": 27,
"parse_task_id": "docmind-20260526-10b94713ccb348498b12180a5dcf32ff",
"layout_count": 48,
"structure_node_count": 6,
"semantic_block_count": 33,
"vector_chunk_count": 34,
"artifact_keys": {
"layouts": "artifacts/9acb2ba0/layouts.json",
"structure_nodes": "artifacts/9acb2ba0/structure_nodes.json",
"semantic_blocks": "artifacts/9acb2ba0/semantic_blocks.json",
"vector_chunks": "artifacts/9acb2ba0/vector_chunks.json"
"layouts": "artifacts/7cbdfe3c/layouts.json",
"structure_nodes": "artifacts/7cbdfe3c/structure_nodes.json",
"semantic_blocks": "artifacts/7cbdfe3c/semantic_blocks.json",
"vector_chunks": "artifacts/7cbdfe3c/vector_chunks.json"
},
"processing_stage": "indexed",
"index_collection": "regulations_dense_1024_v1"
"index_collection": "regulations_dense_1024_v2"
}
}
}

View File

@@ -7,7 +7,7 @@
- 上传入口保持为 `/api/v1/documents/upload`
- 默认 `PARSER_BACKEND=aliyun`
- 默认 `CHUNK_BACKEND=aliyun`
- 默认 Milvus collection 为 `regulations_dense_1536_v2`
- 默认 Milvus collection 为 `regulations_dense_1024_v2`
- 解析产物落到 MinIO `artifacts/{doc_id}/`
完整主链路如下:
@@ -19,7 +19,7 @@
5. 转换为 `structure_nodes / semantic_blocks / vector_chunks`
6. 三层结构 JSON 回写 MinIO
7. 使用 `vector_chunks[*].embedding_text` 调 embedding API
8. 写入 `regulations_dense_1536_v2`
8. 写入 `regulations_dense_1024_v2`
9. 文档状态更新为 `indexed`
运行时转换逻辑位于 `backend/app/infrastructure/parser/aliyun_layout_normalizer.py`

View File

@@ -10,6 +10,31 @@
- 本文档负责冻结目标模块边界、依赖规则和实现组织方式。
- 后续任何代码重构、能力替换或底座升级,都应同时满足 RFC 与本文档。
## 1.1 Document Status And Authority
本文档不是仅供参考的“目标态草案”,而是当前 backend 持续开发的强制架构基线。
- 新增 backend 功能默认必须遵守本文档定义的模块边界与依赖方向。
- 历史实现、迁移中代码和兼容 façade 的存在,不构成继续偏离本文档的理由。
- 当现状与本文档冲突时,新增代码按本文档落位;旧代码按迁移计划逐步收口,但不允许继续扩大 legacy 边界。
- 评审、重构验收和后续架构讨论,均以本文档作为 backend 内部结构的 authority。
## 1.2 Authoritative Scope
本文档约束的 backend 范围包括:
- `backend/app/api/*`
- `backend/app/application/*`
- `backend/app/domain/*`
- `backend/app/infrastructure/*`
- `backend/app/shared/*`
说明:
- `backend/app/services/*``backend/app/workflows/*` 当前属于迁移期 legacy 目录,不是新增业务逻辑的默认落点。
- `backend/app/api/routes/docs.py``backend/app/api/routes/rag.py` 视为遗留或非主入口,除迁移、兼容或下线动作外,不应继续扩展。
- `backend/app/api/routes/compliance.py` 当前仍对外暴露,但尚未完全满足本文档约束;在迁移到 application service 之前,应视为受控 legacy 入口,而不是新的架构样板。
## 2. Current-State Problems
基于当前代码,后端已经具备以下能力:
@@ -22,6 +47,18 @@
但这些能力当前主要是“可运行”,还不是“结构清晰、便于替换、便于演进”的状态。核心问题如下。
### 2.0 Current-State Verdict
基于当前仓库,现状裁决如下:
- 已基本符合:`documents` 上传/查询主链路已经通过 `DocumentCommandService``DocumentQueryService` 收口。
- 已基本符合:`knowledge` 检索已经通过 `KnowledgeRetrievalService` 统一对外暴露。
- 已基本符合:`agent` 问答主链路已经通过 `AgentConversationService` 收口,`shared/bootstrap.py` 已承担 composition root 角色。
- 部分符合Agent session 详情、历史、删除、反馈等接口曾经直接访问 `ConversationStore`,需要继续收口到 application service。
- 未完全符合:`compliance` 路由仍直接处理文件落盘、任务状态和 mock 结果,不符合 `api -> application -> domain ports -> infrastructure`
- 未完全符合:部分 `infrastructure` adapter 仍依赖 `services/*` 内的 legacy 实现,说明迁移尚未彻底完成。
- 未完全符合:`api/main.py` 的生命周期预热逻辑仍直接依赖旧 LLM factory尚未完全回到统一 wiring 边界。
### 2.1 `DocumentProcessor` 责任过载
现状判断:
@@ -603,6 +640,7 @@ infrastructure -> external systems
- `application` 只能依赖 `domain`、端口接口,以及通过 composition root 注入进来的实现实例
- `domain` 不能依赖 `api``infrastructure`
- `infrastructure` 可以依赖 `domain` 定义的端口和数据模型,但不能反向驱动 application 逻辑
- `api/main.py` 这类应用入口可以保留轻量 startup/shutdown 生命周期代码,但不应长期直接依赖 legacy service factory预热与装配逻辑应逐步收口到明确的 wiring 边界
说明:
@@ -739,6 +777,54 @@ infrastructure -> external systems
- 内部 DTO / VO / domain object 收敛到 `application``domain`
- 不允许 API model 直接渗透到 domain
### 10.10 应用入口与启动生命周期
当前:
- `backend/app/api/main.py`
目标:
- 保留 FastAPI app、middleware 和 lifespan 入口职责
- 逐步去除对 legacy LLM factory 的直接依赖
- 预热、清理和依赖装配应保持在明确的 wiring / bootstrap 边界内,而不是继续把旧 service factory 固化为应用入口依赖
### 10.11 Compliance 路由
当前:
- `backend/app/api/routes/compliance.py`
目标:
- 如继续保留该能力,应迁移到独立的 application service 与稳定端口
- 在迁移完成前,该路由视为受控 legacy 入口,可修 bug但不应继续扩展业务编排职责
### 10.12 遗留路由入口
当前:
- `backend/app/api/routes/docs.py`
- `backend/app/api/routes/rag.py`
目标:
- 作为遗留或演示入口逐步归档、下线或迁移
- 不再作为新增 backend 能力的开发入口
### 10.13 Legacy Workflow 与 Service 目录
当前:
- `backend/app/workflows/*`
- `backend/app/services/*`
目标:
- 保留迁移期兼容价值,但不再承载新的长期业务编排
- 若某个 legacy 实现仍被 `infrastructure` adapter 间接复用,应视为过渡依赖,后续逐步迁入 `infrastructure` 或更稳定的底层支撑模块
- 任何新增 backend 业务能力,都不应再以这些目录作为默认落点
## 11. Technology Replacement Boundaries
### 11.1 本地解析 / MinerU -> 阿里云文档解析
@@ -790,6 +876,10 @@ infrastructure -> external systems
- 禁止新建第二个“大一统流程类”替代 `DocumentProcessor`
- 禁止 `knowledge``agent` 各自维护独立检索实现
- 禁止 parser、embedding、vector index、llm provider 的替换穿透到 API 层
- 禁止新增 route 直接访问 `ConversationStore`
- 禁止新增代码把 `backend/app/services/*``backend/app/workflows/*` 作为默认业务落点
- 禁止新增 `infrastructure -> services/*` 的过渡依赖;已有依赖只允许在迁移窗口内逐步消除,不允许继续扩散
- 禁止在 README、开发说明或评审结论中把 legacy 目录描述为当前 backend 的主结构
## 13. Architecture Review Checklist
@@ -807,3 +897,7 @@ infrastructure -> external systems
10. 是否明确 `knowledge``agent` 共用同一 retrieval 底座。
11. 是否明确 API 层只负责 transport concerns不再直接承担业务编排。
12. 是否保证后续替换方案时,上层 application service 与外部 API 契约不被迫变化。
13. 是否仍存在 route 直接访问 `ConversationStore`、文件系统、对象存储或任务状态存储。
14. 是否新增了 `infrastructure -> services/*` 依赖。
15. 是否把新的 backend 业务逻辑写进了 `services/*``workflows/*`
16. README、backend README 与协作说明是否仍与当前 authoritative architecture 保持一致。

View File

@@ -0,0 +1,623 @@
# 核心文档处理主链路说明
本文件说明当前默认生产链路中的核心文档处理流程,也就是:
- `AliyunDocumentParser`
- `AliyunVectorChunkBuilder`
- `OpenAICompatibleEmbeddingProvider`
- `MilvusVectorIndex`
目标是回答四个核心问题:
1. `ParsedDocument` 为什么是多层结构
2. 这些结构分别保存到哪里
3. 哪一步才真正做了向量化
4. Milvus 里最后到底存的是什么
数据库表设计、关系模型、DDL 和 PostgreSQL 职责边界已经单独整理到 [document-processing-database-design.md](/abs/path/C:/Users/A200477427/Developers/AIRegulation/AIRegulation-DocAnalysis/docs/architecture/document-processing-database-design.md:1)。本文件保留流程视角,只在必要处给出与存储分工相关的摘要,不再作为数据库设计 authority。
## 1. 主链路总览
当前默认实现由 `DocumentCommandService.upload_and_process()` 统一编排。它不是“parse 完直接进向量库”,而是先生成结构化解析产物,再把其中适合检索的一层送去 embedding 和 Milvus。
```mermaid
sequenceDiagram
participant API as API / Service
participant MinIO as BinaryStore
participant Parser as AliyunDocumentParser
participant PG as DocumentRepository / ParseArtifactStore
participant Embed as EmbeddingProvider
participant Milvus as VectorIndex
API->>MinIO: 保存原始文件
API->>Parser: parse(file_path, doc_id, doc_name)
Parser-->>API: ParsedDocument
API->>MinIO: 保存 layouts/structure_nodes/semantic_blocks/vector_chunks JSON
API->>PG: 更新 documents.status=parsed
API->>PG: 保存 structure_nodes / semantic_blocks
API->>API: chunk_builder.build(parsed_document)
API->>Embed: embed_texts([chunk.embedding_text])
Embed-->>API: vectors
API->>Milvus: upsert(chunks, vectors)
API->>PG: 更新 documents.status=indexed
```
主链路编排代码在 [services.py](/abs/path/C:/Users/A200477427/Developers/AIRegulation/AIRegulation-DocAnalysis/backend/app/application/documents/services.py:83)
```python
def upload_and_process(
self,
*,
doc_id: str | None = None,
file_name: str,
content: bytes,
content_type: str,
doc_name: str | None,
regulation_type: str,
version: str,
generate_summary: bool,
) -> DocumentProcessResult:
doc_id = doc_id or str(uuid.uuid4())[:8]
final_doc_name = doc_name or file_name
object_name = f"{doc_id}/{file_name}"
self.document_repository.create(document)
self.binary_store.save(object_name=object_name, data=content, content_type=content_type, metadata={"doc_id": doc_id})
self.document_repository.update_status(doc_id, DocumentStatus.STORED)
parsed_document = self.parser.parse(file_path=temp_path, doc_id=doc_id, doc_name=final_doc_name)
artifact_keys = self._save_parse_artifacts(doc_id=doc_id, parsed_document=parsed_document)
self.document_repository.update_status(doc_id, DocumentStatus.PARSED, parser_name=parsed_document.parser_name, metadata={...})
if self.parse_artifact_store:
self.parse_artifact_store.save(doc_id, parsed_document.structure_nodes, parsed_document.semantic_blocks)
chunks = self.chunk_builder.build(parsed_document=parsed_document, regulation_type=regulation_type, version=version)
vectors = self.embedding_provider.embed_texts([chunk.embedding_text for chunk in chunks])
inserted = self.vector_index.upsert(chunks, vectors)
self.document_repository.update_status(doc_id, DocumentStatus.INDEXED, chunk_count=len(chunks), index_name=health.get("collection_name", ""), metadata={...})
```
默认绑定关系在 [bootstrap.py](/abs/path/C:/Users/A200477427/Developers/AIRegulation/AIRegulation-DocAnalysis/backend/app/shared/bootstrap.py:157)
```python
def get_parser():
if settings.parser_backend == "aliyun":
return AliyunDocumentParser()
return LocalDocumentParser()
def get_chunk_builder():
if settings.chunk_backend == "aliyun":
return AliyunVectorChunkBuilder()
return LocalRegulationChunkBuilder(...)
def get_embedding_provider() -> OpenAICompatibleEmbeddingProvider:
return OpenAICompatibleEmbeddingProvider()
def get_vector_index() -> VectorIndex:
return LazyVectorIndex(_build_vector_index)
```
也就是说,当前默认主链路是:
- parser: `AliyunDocumentParser`
- chunk builder: `AliyunVectorChunkBuilder`
- embedding provider: `OpenAICompatibleEmbeddingProvider`
- vector index: `MilvusVectorIndex`
## 2. `ParsedDocument` 为什么是三层结构
`ParsedDocument` 不是最终入库格式,而是 parser 输出给后续处理步骤的统一中间结构。它把“结构理解”和“向量检索准备”拆成了三层。
定义在 [models.py](/abs/path/C:/Users/A200477427/Developers/AIRegulation/AIRegulation-DocAnalysis/backend/app/domain/documents/models.py:49)
```python
@dataclass
class ParsedDocument:
doc_id: str
doc_name: str
structure_nodes: list[dict[str, Any]]
semantic_blocks: list[dict[str, Any]]
vector_chunks: list[dict[str, Any]]
parser_name: str
raw_text: str = ""
raw_layouts: list[dict[str, Any]] = field(default_factory=list)
metadata: dict[str, Any] = field(default_factory=dict)
```
这三层的职责不同:
- `structure_nodes`
- 标题层级骨架
- 描述“文档有哪些章、节、条”
- 用于保留结构,不直接做 embedding
- `semantic_blocks`
- 语义块层
- 把正文、表格、图片说明整理成连续的语义单元
- 是从原始 layout 到检索 chunk 之间的中间层
- `vector_chunks`
- 检索和向量化层
- 已经是适合送给 embedding 模型的 chunk 视图
- 后续 `ChunkBuilder` 基本就是把这层映射成统一 `Chunk`
### 2.1 这三层是怎么从 parser 结果生成的
`AliyunDocumentParser.parse()` 先通过网关拿到阿里云返回的 `layouts`,再把 `layouts` 转成三层结构。
代码在 [aliyun_document_parser.py](/abs/path/C:/Users/A200477427/Developers/AIRegulation/AIRegulation-DocAnalysis/backend/app/infrastructure/parser/aliyun_document_parser.py:28)
```python
def parse(self, *, file_path: str, doc_id: str, doc_name: str) -> ParsedDocument:
payload = self.gateway.parse_document(file_path=file_path)
layouts = payload.layouts
structure_nodes = build_structure_nodes(layouts)
semantic_blocks = build_semantic_blocks(layouts)
vector_chunks = build_vector_chunks(
semantic_blocks,
doc_id=doc_id,
doc_title=doc_name,
max_chars=MAX_CHARS,
overlap_chars=OVERLAP_CHARS,
)
raw_text = "\n\n".join(
block.get("text", "")
for block in semantic_blocks
if block.get("text")
)
return ParsedDocument(
doc_id=doc_id,
doc_name=doc_name,
structure_nodes=structure_nodes,
semantic_blocks=semantic_blocks,
vector_chunks=vector_chunks,
parser_name=self.parser_name,
raw_text=raw_text,
raw_layouts=layouts,
metadata={...},
)
```
也就是说:
- parser 原始输出是 `layouts`
- 当前系统真正消费的是 `ParsedDocument`
- `ParsedDocument` 是由 normalizer 从 `layouts` 规整出来的
### 2.2 第一层:`structure_nodes`
这一层只保留标题和层级。
代码在 [aliyun_layout_normalizer.py](/abs/path/C:/Users/A200477427/Developers/AIRegulation/AIRegulation-DocAnalysis/backend/app/infrastructure/parser/aliyun_layout_normalizer.py:85)
```python
def build_structure_nodes(layouts: list[dict[str, Any]]) -> list[dict[str, Any]]:
nodes: list[dict[str, Any]] = []
for layout in layouts:
if not is_title(layout):
continue
text = get_text(layout)
if not text or text in TOC_TITLES:
continue
nodes.append(
{
"unique_id": layout.get("uniqueId"),
"page": get_page(layout),
"index": layout.get("index", 0),
"level": layout.get("level", 0),
"title": text,
"type": layout.get("type"),
"sub_type": layout.get("subType"),
}
)
return nodes
```
示例:
```json
[
{
"unique_id": "l-title-001",
"page": 2,
"index": 11,
"level": 1,
"title": "1 范围",
"type": "title",
"sub_type": "para_title"
},
{
"unique_id": "l-title-002",
"page": 3,
"index": 18,
"level": 2,
"title": "1.1 适用对象",
"type": "title",
"sub_type": "para_title"
}
]
```
这层的意义是“保留目录树”,不是直接拿来检索。
### 2.3 第二层:`semantic_blocks`
这一层会把连续正文合并成一个语义块,也会单独处理表格和图片说明。
代码在 [aliyun_layout_normalizer.py](/abs/path/C:/Users/A200477427/Developers/AIRegulation/AIRegulation-DocAnalysis/backend/app/infrastructure/parser/aliyun_layout_normalizer.py:163)
```python
def build_semantic_blocks(layouts: list[dict[str, Any]]) -> list[dict[str, Any]]:
semantic_blocks: list[dict[str, Any]] = []
section_stack: list[dict[str, Any]] = []
pending_text_blocks: list[dict[str, Any]] = []
block_id = 1
for layout in layouts:
text = get_text(layout)
page = get_page(layout)
if is_title(layout):
block_id = flush_text_block(pending_text_blocks, semantic_blocks, block_id)
pending_text_blocks = []
section_stack = update_section_path(section_stack, layout)
continue
section_path = section_path_titles(section_stack)
section_title = section_path[-1] if section_path else "未分类"
section_level = len(section_path)
if is_table(layout):
...
semantic_blocks.append(
{
"semantic_id": f"semantic-{block_id}",
"block_type": "table",
"page_start": page,
"page_end": page,
"section_path": section_path,
"section_level": section_level,
"section_title": section_title,
"source_ids": [layout.get("uniqueId")],
"text": table_text,
}
)
continue
if is_text(layout) and text:
pending_text_blocks.append(
{
"page": page,
"text": text,
"unique_id": layout.get("uniqueId"),
"section_path": section_path,
"section_level": section_level,
"section_title": section_title,
}
)
```
正文合并后会形成类似这样的语义块:
```json
[
{
"semantic_id": "semantic-1",
"block_type": "section_text",
"page_start": 2,
"page_end": 2,
"section_path": ["1 范围", "1.1 适用对象"],
"section_level": 2,
"section_title": "1.1 适用对象",
"source_ids": ["l-text-001", "l-text-002"],
"text": "本标准适用于道路车辆动力电池系统的安全要求。企业应建立一致的测试和验证方法。"
}
]
```
## 3. 这些结构分别保存到哪里
### 3.1 原始文件和中间 artifacts 先落 MinIO
当前链路在上传阶段会先把原始文件保存到对象存储;解析完成后,又会把结构化中间产物保存为 JSON。
保存 artifacts 的代码在 [services.py](/abs/path/C:/Users/A200477427/Developers/AIRegulation/AIRegulation-DocAnalysis/backend/app/application/documents/services.py:62)
```python
def _save_parse_artifacts(self, *, doc_id: str, parsed_document: ParsedDocument) -> dict[str, str]:
prefix = f"{parsed_document.metadata.get('artifact_prefix', 'artifacts').strip('/')}/{doc_id}"
artifact_payloads = {
"layouts": parsed_document.raw_layouts,
"structure_nodes": parsed_document.structure_nodes,
"semantic_blocks": parsed_document.semantic_blocks,
"vector_chunks": parsed_document.vector_chunks,
}
artifact_keys: dict[str, str] = {}
for name, payload in artifact_payloads.items():
object_name = f"{prefix}/{name}.json"
self.binary_store.save(
object_name=object_name,
data=json.dumps(payload, ensure_ascii=False, indent=2).encode("utf-8"),
content_type="application/json",
metadata={"doc_id": doc_id, "artifact_type": name},
)
artifact_keys[name] = object_name
return artifact_keys
```
`DocumentBinaryStore` 的当前默认实现是 `MinioDocumentBinaryStore`,也就是:
- 原始上传文件进 MinIO
- `layouts.json` 进 MinIO
- `structure_nodes.json` 进 MinIO
- `semantic_blocks.json` 进 MinIO
- `vector_chunks.json` 进 MinIO
### 3.2 PostgreSQL 在流程中的职责摘要
当前流程中PostgreSQL 承担的是“文档元数据 + 结构化快照”的职责,而不是向量或大对象存储:
- `documents` 保存当前文档主记录、状态、统计和索引信息
- `structure_nodes` 保存当前最新解析快照的目录结构
- `semantic_blocks` 保存当前最新解析快照的语义块结构
更完整的 PostgreSQL 设计,包括:
- `documents`
- `document_processing_runs`
- `document_status_history`
- `document_artifacts`
- `structure_nodes`
- `semantic_blocks`
见 [document-processing-database-design.md](/abs/path/C:/Users/A200477427/Developers/AIRegulation/AIRegulation-DocAnalysis/docs/architecture/document-processing-database-design.md:1)。
### 3.3 存储分工一览
| 数据层 | 保存位置 | 是否直接用于 embedding | 是否最终进入 Milvus | 主要用途 |
| --- | --- | --- | --- | --- |
| 原始文件 | MinIO | 否 | 否 | 保留原始上传文档 |
| `raw_layouts` | MinIO `layouts.json` | 否 | 否 | 保留 parser 原始返回 |
| `structure_nodes` | MinIO + PostgreSQL | 否 | 否 | 目录树、层级结构 |
| `semantic_blocks` | MinIO + PostgreSQL | 否 | 间接 | 语义单元、中间层 |
| `vector_chunks` | MinIO | 是 | 间接 | embedding 前的检索块 |
| `Chunk` | 内存态 + Milvus | 是 | 是 | 统一向量入库模型 |
| `documents` 元数据 | PostgreSQL | 否 | 否 | 处理状态、统计、索引信息 |
## 4. 哪一步才真正“变成向量”
这是整个流程最关键的点。
结论先说清楚:
- parse 不做向量化
- 保存 artifacts 不做向量化
- `ChunkBuilder.build()` 也不做向量化
- 只有 `EmbeddingProvider.embed_texts()` 才真正调用 embedding 模型
- 只有 `VectorIndex.upsert()` 才真正把向量写入向量库
### 4.1 `vector_chunks` 先被映射成统一 `Chunk`
`AliyunVectorChunkBuilder` 并不做 embedding它只负责把 `ParsedDocument.vector_chunks` 转成领域层统一 `Chunk` 模型。
代码在 [vector_chunk_builder.py](/abs/path/C:/Users/A200477427/Developers/AIRegulation/AIRegulation-DocAnalysis/backend/app/infrastructure/parser/vector_chunk_builder.py:12)
```python
def build(
self,
*,
parsed_document: ParsedDocument,
regulation_type: str,
version: str,
) -> list[Chunk]:
chunks: list[Chunk] = []
for index, item in enumerate(parsed_document.vector_chunks):
content = item.get("content") or item.get("text") or ""
embedding_text = item.get("embedding_text") or content
if not embedding_text.strip():
continue
section_path = item.get("section_path") or []
section_title = item.get("section_title") or (section_path[-1] if section_path else "")
page_number = item.get("page_start") or item.get("page") or 0
chunk_id = item.get("chunk_id") or f"{parsed_document.doc_id}-chunk-{index}"
metadata = {k: v for k, v in item.items() if k not in {"content", "embedding_text"}}
chunks.append(
Chunk(
chunk_id=str(chunk_id),
doc_id=parsed_document.doc_id,
doc_name=parsed_document.doc_name,
content=content,
embedding_text=embedding_text,
section_title=section_title,
section_path=section_path,
page_number=int(page_number or 0),
regulation_type=regulation_type,
version=version,
semantic_id=item.get("semantic_id", ""),
block_type=item.get("block_type", ""),
metadata=metadata,
)
)
return chunks
```
`Chunk` 的定义在 [models.py](/abs/path/C:/Users/A200477427/Developers/AIRegulation/AIRegulation-DocAnalysis/backend/app/domain/documents/models.py:63)
```python
@dataclass
class Chunk:
chunk_id: str
doc_id: str
doc_name: str
content: str
embedding_text: str
section_title: str = ""
section_path: list[str] = field(default_factory=list)
page_number: int = 0
regulation_type: str = ""
version: str = ""
semantic_id: str = ""
block_type: str = ""
metadata: dict[str, Any] = field(default_factory=dict)
```
一个 `Chunk` 的典型样子如下:
```json
{
"chunk_id": "doc-001-chunk-1",
"doc_id": "doc-001",
"doc_name": "动力电池安全规范",
"content": "本标准适用于道路车辆动力电池系统的安全要求。企业应建立一致的测试和验证方法。",
"embedding_text": "标准:动力电池安全规范\n章节1 范围 > 1.1 适用对象\n\n本标准适用于道路车辆动力电池系统的安全要求。企业应建立一致的测试和验证方法。",
"section_title": "1.1 适用对象",
"section_path": ["1 范围", "1.1 适用对象"],
"page_number": 2,
"regulation_type": "GB",
"version": "2025",
"semantic_id": "semantic-1",
"block_type": "section_text",
"metadata": {
"chunk_index": 1,
"piece_index": 1,
"source_ids": ["l-text-001", "l-text-002"]
}
}
```
这里最关键的是要区分两个字段:
- `content`
- 用于检索命中后的展示内容
- 更接近用户最终看到的正文片段
- `embedding_text`
- 用于送给 embedding 模型
-`content` 多了“标准名 + 章节路径”的上下文
所以“向量化输入”不是纯正文,而是增强后的上下文文本。
### 4.2 真正调用 embedding API 的地方
真正把文本变成向量的是 `OpenAICompatibleEmbeddingProvider.embed_texts()`
代码在 [openai_compatible_embedding_provider.py](/abs/path/C:/Users/A200477427/Developers/AIRegulation/AIRegulation-DocAnalysis/backend/app/infrastructure/embedding/openai_compatible_embedding_provider.py:64)
```python
def embed_texts(self, texts: list[str]) -> list[list[float]]:
if not texts:
return []
```
也就是说,只有在这一步:
- 输入:`list[str]``embedding_text`
- 输出:`list[list[float]]` 的 dense vectors
前面的 parse、normalizer、chunk builder 都只是准备文本,没有任何向量值产生。
### 4.3 真正把向量写进 Milvus 的地方
向量值生成之后,`MilvusVectorIndex.upsert()` 才会把 `Chunk + vector` 写入向量库。
代码在 [milvus_vector_index.py](/abs/path/C:/Users/A200477427/Developers/AIRegulation/AIRegulation-DocAnalysis/backend/app/infrastructure/vectorstore/milvus_vector_index.py:69)
```python
def upsert(self, chunks: list[Chunk], vectors: list[list[float]]) -> int:
if len(chunks) != len(vectors):
raise ValueError("chunks 与 vectors 数量不一致")
data = []
now = int(time.time())
for chunk, vector in zip(chunks, vectors):
data.append(
{
"id": chunk.chunk_id,
"doc_id": chunk.doc_id,
"doc_name": chunk.doc_name,
"content": chunk.content[:65535],
"embedding": vector,
"section_title": chunk.section_title[:512],
"section_path": json.dumps(chunk.section_path, ensure_ascii=False)[:4096],
"page_number": chunk.page_number,
"regulation_type": chunk.regulation_type[:128],
"version": chunk.version[:64],
"semantic_id": chunk.semantic_id[:128],
"block_type": chunk.block_type[:64],
"metadata_json": json.dumps(chunk.metadata, ensure_ascii=False)[:65535],
"created_at": now,
}
)
self.collection.insert(data)
self.collection.flush()
return len(data)
```
也就是说Milvus 最终存进去的是:
- 主键:`chunk_id`
- 文档维度字段:`doc_id``doc_name`
- 检索展示字段:`content`
- 向量字段:`embedding`
- 过滤/回溯字段:`section_title``section_path``page_number``regulation_type``version``semantic_id``block_type`
- 附加元数据:`metadata_json`
## 5. Milvus 里最后到底存的是什么
### 5.1 Collection schema
当前 `MilvusVectorIndex` 初始化 collection 时定义的 schema 在 [milvus_vector_index.py](/abs/path/C:/Users/A200477427/Developers/AIRegulation/AIRegulation-DocAnalysis/backend/app/infrastructure/vectorstore/milvus_vector_index.py:37)
```python
schema = CollectionSchema(
fields=[
FieldSchema(name="id", dtype=DataType.VARCHAR, max_length=128, is_primary=True, auto_id=False),
FieldSchema(name="doc_id", dtype=DataType.VARCHAR, max_length=64),
FieldSchema(name="doc_name", dtype=DataType.VARCHAR, max_length=256),
FieldSchema(name="content", dtype=DataType.VARCHAR, max_length=65535),
FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=settings.embedding_dim),
FieldSchema(name="section_title", dtype=DataType.VARCHAR, max_length=512),
FieldSchema(name="section_path", dtype=DataType.VARCHAR, max_length=4096),
FieldSchema(name="page_number", dtype=DataType.INT64),
FieldSchema(name="regulation_type", dtype=DataType.VARCHAR, max_length=128),
FieldSchema(name="version", dtype=DataType.VARCHAR, max_length=64),
FieldSchema(name="semantic_id", dtype=DataType.VARCHAR, max_length=128),
FieldSchema(name="block_type", dtype=DataType.VARCHAR, max_length=64),
FieldSchema(name="metadata_json", dtype=DataType.VARCHAR, max_length=65535),
FieldSchema(name="created_at", dtype=DataType.INT64),
],
description="Dense-only regulations index",
enable_dynamic_field=False,
)
```
这说明 Milvus 存的不是“只有 embedding 的极简向量表”,而是:
- 一个 dense vector
- 一组检索时要返回或过滤的结构化字段
但要注意:这并不意味着 Milvus 是业务主记录库。它仍然主要服务于检索,而不是替代 PostgreSQL 的文档管理职责。
### 5.2 `list_documents()` 为什么会先看 Milvus
文档列表查询在 [services.py](/abs/path/C:/Users/A200477427/Developers/AIRegulation/AIRegulation-DocAnalysis/backend/app/application/documents/services.py:271) 中实现,它会:
1. 从 Milvus 查询当前真的存在向量的文档
2. 从文档元数据仓储加载文档记录
3. 以 Milvus 为索引状态真相源进行 merge
原因不是“Milvus 替代 PostgreSQL”而是
- `indexed` 这个状态最终是否真实成立,要看 Milvus 里有没有对应 chunk
- 但下载、删除、重试、文件定位、错误信息仍然要依赖文档元数据仓储
所以:
- Milvus 是“索引真相源”
- PostgreSQL/JSON 是“文档元数据真相源”
这两者职责不同,不能互相替代。

View File

@@ -0,0 +1,508 @@
# 文档处理链路数据库设计
## 1. Purpose
本文档定义当前文档处理主链路的 PostgreSQL 数据库设计,覆盖上传、解析、索引、状态查询、重试、删除这条核心链路,以及围绕该链路的常用运维与审计需求。
本文档的目标不是替代 [document-core-processing-flow.md](/abs/path/C:/Users/A200477427/Developers/AIRegulation/AIRegulation-DocAnalysis/docs/architecture/document-core-processing-flow.md:1) 的流程说明,而是补齐关系型存储的 authority使后续从 JSON 元数据切换到 PostgreSQL 时有清晰、稳定、可实施的数据库设计基线。
## 1.1 Scope And Design Target
本文档只覆盖以下范围:
- 文档主记录
- 文档处理运行记录
- 文档状态历史
- 解析产物引用
- 当前最新结构化解析快照
本文档不覆盖以下范围:
- Agent 会话
- 反馈和人工审核
- 合规分析任务
- Milvus collection schema 的详细实现
设计原则采用 `Compat First`
- 保持与当前 `DocumentRepository` / `ParseArtifactStore` 主流程兼容
- 新增关系表以补足运维与审计能力
- 不为了理想化模型而反推大规模接口重写
## 2. Storage Responsibilities
当前系统采用三类存储,各自职责必须清晰分离:
| 存储 | 保存内容 | 是否业务主记录 | 说明 |
| --- | --- | --- | --- |
| MinIO | 原始文件、`layouts.json``structure_nodes.json``semantic_blocks.json``vector_chunks.json` | 否 | 负责大对象与产物归档,不承担关系查询 |
| Milvus | chunk 级向量和检索辅助字段 | 否 | 负责向量检索,不承担文档生命周期管理 |
| PostgreSQL | 文档元数据、处理状态、结构化快照、处理历史、artifact 引用 | 是 | 负责文档管理、运维可观测性和关系查询 |
约束说明:
- PostgreSQL 不保存 embedding 向量。
- PostgreSQL 不新增 `vector_chunks` 内容表。
- Milvus 可以保存 `doc_id``doc_name``regulation_type``version` 等检索辅助字段,但不是业务真相源。
- 文档下载、删除、重试仍以 PostgreSQL 中的文档主记录为入口。
## 3. Design Overview
### 3.1 Entity Responsibilities
数据库采用“当前态主记录 + 当前快照 + 历史过程”的分层模型:
- `documents`
- 当前文档主记录
- 保存供管理、下载、重试、删除直接使用的元数据和当前状态
- `document_processing_runs`
- 每次上传或重试对应一次处理运行
- 保存运行级统计、阶段时间点和失败信息
- `document_status_history`
- 追加式状态事件流
- 保存每次状态变更的上下文
- `document_artifacts`
- 保存 MinIO artifact 的引用信息
- 不保存 artifact 内容本体
- `structure_nodes`
- 当前最新解析快照中的目录结构
- `semantic_blocks`
- 当前最新解析快照中的语义块结构
### 3.2 Current Snapshot Vs Historical Records
本设计显式区分两类数据:
- 当前快照
- `documents`
- `structure_nodes`
- `semantic_blocks`
- 历史过程
- `document_processing_runs`
- `document_status_history`
- `document_artifacts`
其中:
- `structure_nodes``semantic_blocks` 只保存“最新一次成功解析后”的当前快照
- 历史版本回溯依赖 `document_processing_runs``document_artifacts` 和 MinIO 中对应 run 的 artifact 文件
## 4. Table Design
### 4.1 `documents`
用途:
- 作为文档生命周期的主记录表
- 为下载、删除、重试、管理列表、状态查询提供当前态真相
字段设计:
```sql
CREATE TABLE IF NOT EXISTS documents (
doc_id VARCHAR(128) PRIMARY KEY,
doc_name VARCHAR(512) NOT NULL DEFAULT '',
file_name VARCHAR(512) NOT NULL DEFAULT '',
object_name VARCHAR(1024) NOT NULL DEFAULT '',
content_type VARCHAR(128) NOT NULL DEFAULT '',
size_bytes BIGINT NOT NULL DEFAULT 0,
status VARCHAR(32) NOT NULL DEFAULT 'pending',
regulation_type VARCHAR(128) NOT NULL DEFAULT '',
version VARCHAR(64) NOT NULL DEFAULT '',
summary TEXT NOT NULL DEFAULT '',
summary_latency_ms INTEGER NOT NULL DEFAULT 0,
chunk_count INTEGER NOT NULL DEFAULT 0,
parser_name VARCHAR(128) NOT NULL DEFAULT '',
index_name VARCHAR(128) NOT NULL DEFAULT '',
error_message TEXT NOT NULL DEFAULT '',
metadata JSONB NOT NULL DEFAULT '{}'::jsonb,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
CONSTRAINT chk_documents_status
CHECK (status IN ('pending', 'stored', 'parsed', 'indexed', 'failed'))
);
CREATE INDEX IF NOT EXISTS idx_documents_status_updated_at
ON documents(status, updated_at DESC);
CREATE INDEX IF NOT EXISTS idx_documents_regulation_version
ON documents(regulation_type, version);
CREATE INDEX IF NOT EXISTS idx_documents_updated_at
ON documents(updated_at DESC);
```
字段说明:
- `object_name`
- 原始上传文件在 MinIO 中的对象路径
- 当前实现依赖该字段完成下载、重试和删除v1 不拆分为独立文件表
- `status`
- 当前文档处理状态
- 仅表示当前态,不承担历史审计职责
- `metadata`
- 保存轻量、变动频率较高、暂不值得列式建模的附加信息
- 典型内容包括 `parse_task_id``processing_stage``artifact_keys`、统计计数等
### 4.2 `document_processing_runs`
用途:
- 记录一次上传或一次重试的完整处理运行
- 用于解释“这份文档本次处理为什么成功或失败”
字段设计:
```sql
CREATE TABLE IF NOT EXISTS document_processing_runs (
run_id BIGSERIAL PRIMARY KEY,
doc_id VARCHAR(128) NOT NULL,
trigger_type VARCHAR(16) NOT NULL,
run_status VARCHAR(16) NOT NULL,
parser_backend VARCHAR(64) NOT NULL DEFAULT '',
chunk_backend VARCHAR(64) NOT NULL DEFAULT '',
embedding_model VARCHAR(128) NOT NULL DEFAULT '',
index_name VARCHAR(128) NOT NULL DEFAULT '',
started_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
stored_at TIMESTAMPTZ,
parsed_at TIMESTAMPTZ,
indexed_at TIMESTAMPTZ,
finished_at TIMESTAMPTZ,
layout_count INTEGER NOT NULL DEFAULT 0,
structure_node_count INTEGER NOT NULL DEFAULT 0,
semantic_block_count INTEGER NOT NULL DEFAULT 0,
vector_chunk_count INTEGER NOT NULL DEFAULT 0,
chunk_count INTEGER NOT NULL DEFAULT 0,
failure_stage VARCHAR(32) NOT NULL DEFAULT '',
error_message TEXT NOT NULL DEFAULT '',
metadata JSONB NOT NULL DEFAULT '{}'::jsonb,
CONSTRAINT fk_runs_document
FOREIGN KEY (doc_id) REFERENCES documents(doc_id) ON DELETE CASCADE,
CONSTRAINT chk_runs_trigger_type
CHECK (trigger_type IN ('upload', 'retry')),
CONSTRAINT chk_runs_status
CHECK (run_status IN ('running', 'succeeded', 'failed'))
);
CREATE INDEX IF NOT EXISTS idx_runs_doc_started_at
ON document_processing_runs(doc_id, started_at DESC);
CREATE INDEX IF NOT EXISTS idx_runs_status_started_at
ON document_processing_runs(run_status, started_at DESC);
```
字段说明:
- `trigger_type`
- 标识该次处理由首次上传还是 retry 触发
- `run_status`
- 只表示该次运行的最终结果
- `failure_stage`
- 建议取值与应用层关键阶段一致,例如 `store``parse``artifact_persist``embed``index`
- `metadata`
- 保存运行级附加上下文例如配置快照、后端实现名、provider 返回信息摘要
### 4.3 `document_status_history`
用途:
- 保存状态变化事件流
- 用于排障、审计和运行轨迹分析
字段设计:
```sql
CREATE TABLE IF NOT EXISTS document_status_history (
event_id BIGSERIAL PRIMARY KEY,
doc_id VARCHAR(128) NOT NULL,
run_id BIGINT,
from_status VARCHAR(32) NOT NULL DEFAULT '',
to_status VARCHAR(32) NOT NULL,
stage VARCHAR(32) NOT NULL DEFAULT '',
message TEXT NOT NULL DEFAULT '',
metadata JSONB NOT NULL DEFAULT '{}'::jsonb,
occurred_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
CONSTRAINT fk_status_document
FOREIGN KEY (doc_id) REFERENCES documents(doc_id) ON DELETE CASCADE,
CONSTRAINT fk_status_run
FOREIGN KEY (run_id) REFERENCES document_processing_runs(run_id) ON DELETE CASCADE,
CONSTRAINT chk_status_history_to_status
CHECK (to_status IN ('pending', 'stored', 'parsed', 'indexed', 'failed'))
);
CREATE INDEX IF NOT EXISTS idx_status_history_doc_occurred_at
ON document_status_history(doc_id, occurred_at DESC);
CREATE INDEX IF NOT EXISTS idx_status_history_run_occurred_at
ON document_status_history(run_id, occurred_at DESC);
```
字段说明:
- `from_status` 可以为空字符串
- 用于首个事件,例如文档创建时进入 `pending`
- `stage`
- 用于记录状态推进对应的业务阶段
- `message`
- 用于记录面向排障的人类可读说明
### 4.4 `document_artifacts`
用途:
- 保存解析产物在 MinIO 中的位置与基本属性
- 支持后续定位某次 run 的 artifacts而不扫描对象存储
字段设计:
```sql
CREATE TABLE IF NOT EXISTS document_artifacts (
artifact_id BIGSERIAL PRIMARY KEY,
doc_id VARCHAR(128) NOT NULL,
run_id BIGINT,
artifact_type VARCHAR(32) NOT NULL,
object_name VARCHAR(1024) NOT NULL,
content_type VARCHAR(128) NOT NULL DEFAULT 'application/json',
byte_size BIGINT NOT NULL DEFAULT 0,
checksum VARCHAR(128) NOT NULL DEFAULT '',
metadata JSONB NOT NULL DEFAULT '{}'::jsonb,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
CONSTRAINT fk_artifacts_document
FOREIGN KEY (doc_id) REFERENCES documents(doc_id) ON DELETE CASCADE,
CONSTRAINT fk_artifacts_run
FOREIGN KEY (run_id) REFERENCES document_processing_runs(run_id) ON DELETE CASCADE,
CONSTRAINT chk_artifact_type
CHECK (artifact_type IN ('layouts', 'structure_nodes', 'semantic_blocks', 'vector_chunks'))
);
CREATE INDEX IF NOT EXISTS idx_artifacts_doc_created_at
ON document_artifacts(doc_id, created_at DESC);
CREATE INDEX IF NOT EXISTS idx_artifacts_run_type
ON document_artifacts(run_id, artifact_type);
CREATE UNIQUE INDEX IF NOT EXISTS uq_artifacts_run_type_object
ON document_artifacts(run_id, artifact_type, object_name);
```
字段说明:
- 该表只记录 artifact 引用,不记录原始文件
- 原始文件仍由 `documents.object_name` 表达,这是为了保持当前下载和重试逻辑兼容
### 4.5 `structure_nodes`
用途:
- 保存当前最新解析快照中的标题层级结构
- 供目录树查询、结构化浏览、调试和审计使用
字段设计:
```sql
CREATE TABLE IF NOT EXISTS structure_nodes (
id BIGSERIAL PRIMARY KEY,
doc_id VARCHAR(128) NOT NULL,
unique_id VARCHAR(128),
page INTEGER NOT NULL DEFAULT 0,
idx INTEGER NOT NULL DEFAULT 0,
level INTEGER NOT NULL DEFAULT 0,
title TEXT NOT NULL DEFAULT '',
type VARCHAR(64),
sub_type VARCHAR(64),
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
CONSTRAINT fk_structure_nodes_document
FOREIGN KEY (doc_id) REFERENCES documents(doc_id) ON DELETE CASCADE
);
CREATE INDEX IF NOT EXISTS idx_structure_nodes_doc_idx
ON structure_nodes(doc_id, idx);
CREATE INDEX IF NOT EXISTS idx_structure_nodes_doc_level
ON structure_nodes(doc_id, level);
```
设计约束:
- 该表表示当前快照,不做多版本建模
- 新一轮成功解析会覆盖同一 `doc_id` 的旧快照
### 4.6 `semantic_blocks`
用途:
- 保存当前最新解析快照中的语义块
- 供结构回溯、调试和后续关系型查询使用
字段设计:
```sql
CREATE TABLE IF NOT EXISTS semantic_blocks (
id BIGSERIAL PRIMARY KEY,
doc_id VARCHAR(128) NOT NULL,
semantic_id VARCHAR(128) NOT NULL,
block_type VARCHAR(64) NOT NULL DEFAULT '',
page_start INTEGER NOT NULL DEFAULT 0,
page_end INTEGER NOT NULL DEFAULT 0,
section_path JSONB NOT NULL DEFAULT '[]'::jsonb,
section_level INTEGER NOT NULL DEFAULT 0,
section_title VARCHAR(512) NOT NULL DEFAULT '',
source_ids JSONB NOT NULL DEFAULT '[]'::jsonb,
text TEXT NOT NULL DEFAULT '',
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
CONSTRAINT fk_semantic_blocks_document
FOREIGN KEY (doc_id) REFERENCES documents(doc_id) ON DELETE CASCADE,
CONSTRAINT uq_semantic_blocks_doc_semantic
UNIQUE (doc_id, semantic_id)
);
CREATE INDEX IF NOT EXISTS idx_semantic_blocks_doc_id
ON semantic_blocks(doc_id);
CREATE INDEX IF NOT EXISTS idx_semantic_blocks_doc_section_title
ON semantic_blocks(doc_id, section_title);
CREATE INDEX IF NOT EXISTS idx_semantic_blocks_doc_block_type
ON semantic_blocks(doc_id, block_type);
```
设计约束:
- 该表表示当前快照,不保存历史版本
- 历史回溯应通过 run 对应的 artifact 文件完成
## 5. Relationship Model
实体关系如下:
```mermaid
erDiagram
documents ||--o{ document_processing_runs : has
documents ||--o{ document_status_history : has
documents ||--o{ document_artifacts : has
documents ||--o{ structure_nodes : has
documents ||--o{ semantic_blocks : has
document_processing_runs ||--o{ document_status_history : emits
document_processing_runs ||--o{ document_artifacts : produces
```
关系语义:
- `documents` 是聚合根
- `document_processing_runs` 记录一次完整处理尝试
- `document_status_history` 记录状态推进轨迹
- `document_artifacts` 记录 MinIO 中可回放的结构化产物
- `structure_nodes` / `semantic_blocks` 代表“当前版本”的关系型快照
## 6. Flow-To-Table Mapping
### 6.1 Upload
上传开始时:
1. 创建 `documents`
2. 创建一条 `document_processing_runs`
3. 写入一条 `document_status_history``to_status='pending'`
### 6.2 Store Original File
原始文件写入 MinIO 成功后:
1. 更新 `documents.status='stored'`
2. 更新当前 run 的 `stored_at`
3. 追加 `document_status_history`
### 6.3 Parse And Persist Artifacts
解析成功后:
1. 更新当前 run 的 `parsed_at`
2. 更新 run 的 `layout_count``structure_node_count``semantic_block_count``vector_chunk_count`
3. 更新 `documents.status='parsed'`
4. 刷新 `structure_nodes`
5. 刷新 `semantic_blocks`
6.`layouts``structure_nodes``semantic_blocks``vector_chunks` 写入 `document_artifacts`
7. 追加 `document_status_history`
### 6.4 Embed And Index
向量化和入库成功后:
1. 更新当前 run 的 `indexed_at``finished_at`
2. 更新当前 run 的 `run_status='succeeded'`
3. 更新 `documents.status='indexed'`
4. 更新 `documents.chunk_count``index_name`
5. 追加 `document_status_history`
### 6.5 Failure
任一阶段失败时:
1. 更新当前 run 的 `run_status='failed'`
2. 记录 `failure_stage``error_message`
3. 更新 `finished_at`
4. 更新 `documents.status='failed'`
5. 更新 `documents.error_message`
6. 追加 `document_status_history`
### 6.6 Retry
重试时:
1. 保留现有 `documents.doc_id`
2. 新建一条 `document_processing_runs`
3. 为本次重试重新写入状态历史
4. 本次重试成功后覆盖 `structure_nodes` / `semantic_blocks` 当前快照
5. 历史 run 和 artifact 记录继续保留
### 6.7 Delete
删除文档时:
1. 应用层先删除 MinIO 原始文件和 artifacts
2. 应用层删除 Milvus 中按 `doc_id` 关联的向量
3. 最后删除 `documents`
4. 依赖外键 `ON DELETE CASCADE` 清理 run、status history、artifacts、structure nodes、semantic blocks
## 7. Alignment With Current Backend
### 7.1 Compatible Parts
当前代码已天然兼容以下设计:
- `documents`
- `structure_nodes`
- `semantic_blocks`
- 当前快照覆盖式更新
- `doc_id` 作为跨 MinIO / Milvus / PostgreSQL 的统一关联键
### 7.2 Required Future Additions
若后续正式切到 PostgreSQL 默认元数据后端,应新增以下内部 store 或 repository
- `DocumentProcessingRunStore`
- `DocumentStatusEventStore`
- `DocumentArtifactStore`
这些新增能力属于内部增强,不要求修改现有 HTTP API。
### 7.3 Migration Guidance
从当前 JSON 元数据切换到 PostgreSQL 时,建议按以下顺序进行:
1. 迁移 `documents.json` 中已有文档主记录到 `documents`
2.`DOCUMENT_REPOSITORY_BACKEND` 切换为 `postgres`
3. 为新上传或重试的文档开始写入 run / status history / artifact records
4. 历史文档若缺少 run 级数据,可允许为空,不阻塞切换
## 8. Non-Goals
以下能力不在本设计 v1 范围内:
- 将 Milvus 替换为 PostgreSQL 向量能力
- 在 PostgreSQL 中保存向量字段
-`vector_chunks` 建独立关系表
-`structure_nodes` / `semantic_blocks` 建历史版本仓库
- 将原始文件抽象成独立 `document_files`
这些能力可能在后续重构时被讨论,但不应影响当前主链路切换和现有应用层兼容性。

2
frontend/.env Normal file
View File

@@ -0,0 +1,2 @@
VITE_API_PROXY_TARGET=http://6.86.80.8:8000
FRONTEND_PORT=5173

View File

@@ -0,0 +1,2 @@
VITE_API_PROXY_TARGET=http://127.0.0.1:8000
FRONTEND_PORT=5173

2
frontend/.env.example Normal file
View File

@@ -0,0 +1,2 @@
VITE_API_PROXY_TARGET=http://127.0.0.1:8000
FRONTEND_PORT=5173

View File

@@ -49,6 +49,12 @@ npm run dev
启动本地开发服务器,默认访问 `http://localhost:5173`
前端环境文件约定如下:
- `frontend/.env.development`:本地开发,默认代理到 `http://127.0.0.1:8000`
- `frontend/.env.production`:生产构建,默认代理到 `http://6.86.80.8:8000`
- `frontend/.env.local`:临时覆盖本机配置,优先级高于上面两者
### 构建生产版本
```bash

25
frontend/components.json Normal file
View File

@@ -0,0 +1,25 @@
{
"$schema": "https://ui.shadcn.com/schema.json",
"style": "radix-nova",
"rsc": false,
"tsx": true,
"tailwind": {
"config": "tailwind.config.js",
"css": "src/styles/globals.css",
"baseColor": "neutral",
"cssVariables": true,
"prefix": ""
},
"iconLibrary": "lucide",
"rtl": false,
"aliases": {
"components": "@/components",
"utils": "@/lib/utils",
"ui": "@/components/shadcn/ui",
"lib": "@/lib",
"hooks": "@/hooks"
},
"menuColor": "default",
"menuAccent": "subtle",
"registries": {}
}

View File

@@ -0,0 +1,67 @@
import * as React from "react"
import { cva, type VariantProps } from "class-variance-authority"
import { Slot } from "radix-ui"
import { cn } from "@/lib/utils"
const buttonVariants = cva(
"group/button inline-flex shrink-0 items-center justify-center rounded-lg border border-transparent bg-clip-padding text-sm font-medium whitespace-nowrap transition-all outline-none select-none focus-visible:border-ring focus-visible:ring-3 focus-visible:ring-ring/50 active:not-aria-[haspopup]:translate-y-px disabled:pointer-events-none disabled:opacity-50 aria-invalid:border-destructive aria-invalid:ring-3 aria-invalid:ring-destructive/20 dark:aria-invalid:border-destructive/50 dark:aria-invalid:ring-destructive/40 [&_svg]:pointer-events-none [&_svg]:shrink-0 [&_svg:not([class*='size-'])]:size-4",
{
variants: {
variant: {
default: "bg-primary text-primary-foreground [a]:hover:bg-primary/80",
outline:
"border-border bg-background hover:bg-muted hover:text-foreground aria-expanded:bg-muted aria-expanded:text-foreground dark:border-input dark:bg-input/30 dark:hover:bg-input/50",
secondary:
"bg-secondary text-secondary-foreground hover:bg-secondary/80 aria-expanded:bg-secondary aria-expanded:text-secondary-foreground",
ghost:
"hover:bg-muted hover:text-foreground aria-expanded:bg-muted aria-expanded:text-foreground dark:hover:bg-muted/50",
destructive:
"bg-destructive/10 text-destructive hover:bg-destructive/20 focus-visible:border-destructive/40 focus-visible:ring-destructive/20 dark:bg-destructive/20 dark:hover:bg-destructive/30 dark:focus-visible:ring-destructive/40",
link: "text-primary underline-offset-4 hover:underline",
},
size: {
default:
"h-8 gap-1.5 px-2.5 has-data-[icon=inline-end]:pr-2 has-data-[icon=inline-start]:pl-2",
xs: "h-6 gap-1 rounded-[min(var(--radius-md),10px)] px-2 text-xs in-data-[slot=button-group]:rounded-lg has-data-[icon=inline-end]:pr-1.5 has-data-[icon=inline-start]:pl-1.5 [&_svg:not([class*='size-'])]:size-3",
sm: "h-7 gap-1 rounded-[min(var(--radius-md),12px)] px-2.5 text-[0.8rem] in-data-[slot=button-group]:rounded-lg has-data-[icon=inline-end]:pr-1.5 has-data-[icon=inline-start]:pl-1.5 [&_svg:not([class*='size-'])]:size-3.5",
lg: "h-9 gap-1.5 px-2.5 has-data-[icon=inline-end]:pr-2 has-data-[icon=inline-start]:pl-2",
icon: "size-8",
"icon-xs":
"size-6 rounded-[min(var(--radius-md),10px)] in-data-[slot=button-group]:rounded-lg [&_svg:not([class*='size-'])]:size-3",
"icon-sm":
"size-7 rounded-[min(var(--radius-md),12px)] in-data-[slot=button-group]:rounded-lg",
"icon-lg": "size-9",
},
},
defaultVariants: {
variant: "default",
size: "default",
},
}
)
function Button({
className,
variant = "default",
size = "default",
asChild = false,
...props
}: React.ComponentProps<"button"> &
VariantProps<typeof buttonVariants> & {
asChild?: boolean
}) {
const Comp = asChild ? Slot.Root : "button"
return (
<Comp
data-slot="button"
data-variant={variant}
data-size={size}
className={cn(buttonVariants({ variant, size, className }))}
{...props}
/>
)
}
export { Button }

View File

@@ -9,7 +9,8 @@
"version": "0.0.0",
"dependencies": {
"react": "^19.2.5",
"react-dom": "^19.2.5"
"react-dom": "^19.2.5",
"react-router-dom": "^7.9.6"
},
"devDependencies": {
"@eslint/js": "^10.0.1",
@@ -1631,6 +1632,19 @@
"dev": true,
"license": "MIT"
},
"node_modules/cookie": {
"version": "1.1.1",
"resolved": "https://registry.npmjs.org/cookie/-/cookie-1.1.1.tgz",
"integrity": "sha512-ei8Aos7ja0weRpFzJnEA9UHJ/7XQmqglbRwnf2ATjcB9Wq874VKH9kfjjirM6UhU2/E5fFYadylyhFldcqSidQ==",
"license": "MIT",
"engines": {
"node": ">=18"
},
"funding": {
"type": "opencollective",
"url": "https://opencollective.com/express"
}
},
"node_modules/cross-spawn": {
"version": "7.0.6",
"resolved": "https://registry.npmjs.org/cross-spawn/-/cross-spawn-7.0.6.tgz",
@@ -2751,6 +2765,44 @@
"react": "^19.2.5"
}
},
"node_modules/react-router": {
"version": "7.15.1",
"resolved": "https://registry.npmjs.org/react-router/-/react-router-7.15.1.tgz",
"integrity": "sha512-R8rl9HhgikFYoPJymnUtPXWbnDb3oget6lQnfIoupbt61aT9aOhRkDsY2XRhZRyX1Z/8a5sL74fXmFNm3NRK5A==",
"license": "MIT",
"dependencies": {
"cookie": "^1.0.1",
"set-cookie-parser": "^2.6.0"
},
"engines": {
"node": ">=20.0.0"
},
"peerDependencies": {
"react": ">=18",
"react-dom": ">=18"
},
"peerDependenciesMeta": {
"react-dom": {
"optional": true
}
}
},
"node_modules/react-router-dom": {
"version": "7.15.1",
"resolved": "https://registry.npmjs.org/react-router-dom/-/react-router-dom-7.15.1.tgz",
"integrity": "sha512-AzF62gjY6U9rkMq4RfP/r2EVtQ7DMfNMjyOp/flLTCrtRylLiK4wT4pSq6O8rOXZ2eXdZYJPEYe+ifomiv+Igg==",
"license": "MIT",
"dependencies": {
"react-router": "7.15.1"
},
"engines": {
"node": ">=20.0.0"
},
"peerDependencies": {
"react": ">=18",
"react-dom": ">=18"
}
},
"node_modules/rolldown": {
"version": "1.0.0-rc.17",
"resolved": "https://registry.npmjs.org/rolldown/-/rolldown-1.0.0-rc.17.tgz",
@@ -2808,6 +2860,12 @@
"semver": "bin/semver.js"
}
},
"node_modules/set-cookie-parser": {
"version": "2.7.2",
"resolved": "https://registry.npmjs.org/set-cookie-parser/-/set-cookie-parser-2.7.2.tgz",
"integrity": "sha512-oeM1lpU/UvhTxw+g3cIfxXHyJRc/uidd3yK1P242gzHds0udQBYzs3y8j4gCCW+ZJ7ad0yctld8RYO+bdurlvw==",
"license": "MIT"
},
"node_modules/shebang-command": {
"version": "2.0.0",
"resolved": "https://registry.npmjs.org/shebang-command/-/shebang-command-2.0.0.tgz",

View File

@@ -10,8 +10,17 @@
"preview": "vite preview"
},
"dependencies": {
"@fontsource-variable/geist": "^5.2.9",
"class-variance-authority": "^0.7.1",
"clsx": "^2.1.1",
"lucide-react": "^1.16.0",
"radix-ui": "^1.4.3",
"react": "^19.2.5",
"react-dom": "^19.2.5"
"react-dom": "^19.2.5",
"react-router-dom": "^7.9.6",
"shadcn": "^4.8.0",
"tailwind-merge": "^3.6.0",
"tw-animate-css": "^1.4.0"
},
"devDependencies": {
"@eslint/js": "^10.0.1",

4076
frontend/pnpm-lock.yaml generated

File diff suppressed because it is too large Load Diff

View File

@@ -1,46 +1,11 @@
import './styles/globals.css';
import { ThemeProvider, AppProvider, useApp, useTheme } from './contexts';
import { Header, Tabs } from './components/layout';
import { CompliancePage } from './pages/Compliance';
import { DocsPage } from './pages/Docs';
import { StatusPage } from './pages/Status';
import { RagChatPage } from './pages/RagChat';
const PageContent = () => {
const { activeTab } = useApp();
switch (activeTab) {
case 'docs':
return <DocsPage />;
case 'compliance':
return <CompliancePage />;
case 'status':
return <StatusPage />;
case 'rag':
return <RagChatPage />;
default:
return <CompliancePage />;
}
};
const AppContent = () => {
const { theme } = useTheme();
return (
<div className="h-full flex flex-col min-h-screen" style={{ backgroundColor: theme.bg }}>
<Header />
<Tabs />
<PageContent />
</div>
);
};
import { ThemeProvider } from './contexts';
import { AppRouter } from './router/AppRouter';
function App() {
return (
<ThemeProvider>
<AppProvider>
<AppContent />
</AppProvider>
<AppRouter />
</ThemeProvider>
);
}

View File

@@ -0,0 +1,128 @@
const PERCEPTION_API_BASE = '/api/v1';
export type ImpactLevel = 'high' | 'medium' | 'low';
export type EventStatus = 'enacted' | 'draft' | 'consultation';
export type EventSource = 'MIIT' | 'UN-ECE' | 'ISO' | '国标委' | 'EUR-Lex' | 'IATF';
export interface RegulationEvent {
id: string;
source: EventSource;
source_label: string;
standard_code: string;
title: string;
summary: string;
impact_level: ImpactLevel;
published_at: string;
effective_at: string | null;
category: string;
tags: string[];
source_url: string;
status: EventStatus;
}
export interface PerceptionStats {
total: number;
high_impact: number;
medium_impact: number;
low_impact: number;
recent_90d: number;
}
export interface EventListResponse {
events: RegulationEvent[];
total: number;
}
export interface AffectedDoc {
doc_id: string;
doc_name: string;
score: number;
snippet: string;
clause: string;
}
export interface AnalysisSSEMessage {
type: 'sources' | 'content' | 'done' | 'error';
docs?: AffectedDoc[];
text?: string;
}
export async function getPerceptionStats(): Promise<PerceptionStats> {
const res = await fetch(`${PERCEPTION_API_BASE}/perception/stats`);
if (!res.ok) throw new Error(`stats failed: ${res.status}`);
return res.json() as Promise<PerceptionStats>;
}
export async function listEvents(params?: {
source?: string;
impact_level?: string;
limit?: number;
}): Promise<EventListResponse> {
const query = new URLSearchParams();
if (params?.source) query.set('source', params.source);
if (params?.impact_level) query.set('impact_level', params.impact_level);
if (params?.limit) query.set('limit', String(params.limit));
const res = await fetch(`${PERCEPTION_API_BASE}/perception/events?${query.toString()}`);
if (!res.ok) throw new Error(`list events failed: ${res.status}`);
return res.json() as Promise<EventListResponse>;
}
export async function analyzeEvent(
eventId: string,
onMessage: (msg: AnalysisSSEMessage) => void,
onComplete?: () => void,
signal?: AbortSignal,
): Promise<void> {
try {
const res = await fetch(`${PERCEPTION_API_BASE}/perception/events/${eventId}/analyze`, {
method: 'POST',
headers: { Accept: 'text/event-stream' },
signal,
});
if (!res.ok || !res.body) throw new Error(`analyze failed: ${res.status}`);
const reader = res.body.getReader();
const decoder = new TextDecoder();
let buffer = '';
while (true) {
const { done, value } = await reader.read();
if (done) break;
buffer += decoder.decode(value, { stream: true });
const parts = buffer.split('\n\n');
buffer = parts.pop() ?? '';
for (const block of parts) {
if (!block.trim()) continue;
let eventName = 'message';
const dataLines: string[] = [];
for (const line of block.split('\n')) {
if (line.startsWith('event:')) eventName = line.slice(6).trim();
else if (line.startsWith('data:')) dataLines.push(line.slice(5).trim());
}
const payload = dataLines.join('\n');
if (!payload) continue;
if (eventName === 'sources') {
try {
const docs = JSON.parse(payload) as AffectedDoc[];
onMessage({ type: 'sources', docs });
} catch { /* ignore */ }
} else if (eventName === 'content') {
onMessage({ type: 'content', text: payload });
} else if (eventName === 'done') {
onMessage({ type: 'done' });
} else if (eventName === 'error') {
onMessage({ type: 'error', text: payload });
}
}
}
if (buffer.trim()) {
// flush remaining
}
onComplete?.();
} catch (err) {
if (err instanceof DOMException && err.name === 'AbortError') return;
onMessage({ type: 'error', text: err instanceof Error ? err.message : String(err) });
onComplete?.();
}
}

View File

@@ -1,35 +1,34 @@
import React from 'react';
import { Moon, Sun, SunMedium } from 'lucide-react';
import { useTheme } from '../../contexts';
import { Button } from '../shadcn/ui/button';
const NEXT_LABELS: Record<string, string> = {
dark: '过渡色模式',
dim: '亮色模式',
light: '暗色模式',
};
export const ThemeToggle: React.FC = () => {
const { isDark, toggleTheme, theme } = useTheme();
const { themeMode, toggleTheme } = useTheme();
// Shows the NEXT state's icon: dark→SunMedium(dim next), dim→Sun(light next), light→Moon(dark next)
const Icon =
themeMode === 'dark' ? SunMedium :
themeMode === 'dim' ? Sun :
Moon;
return (
<button
<Button
onClick={toggleTheme}
style={{
width: 44,
height: 44,
borderRadius: 10,
background: isDark ? theme.bgHover : theme.bgCard,
border: `1px solid ${theme.border}`,
cursor: 'pointer',
display: 'flex',
alignItems: 'center',
justifyContent: 'center',
transition: 'all 0.3s ease',
}}
variant="outline"
size="icon-lg"
className="rounded-xl border-border bg-card text-muted-foreground hover:bg-muted hover:text-foreground"
aria-label={`切换到${NEXT_LABELS[themeMode]}`}
title={`切换到${NEXT_LABELS[themeMode]}`}
>
{isDark ? (
<svg width="20" height="20" viewBox="0 0 24 24" fill="none">
<circle cx="12" cy="12" r="4" fill={theme.accent}/>
<path d="M12 2V4M12 20V22M4 12H2M22 12H20M6.34 6.34L4.93 4.93M19.07 19.07L17.66 17.66M6.34 17.66L4.93 19.07M19.07 4.93L17.66 6.34" stroke={theme.accent} strokeWidth="2" strokeLinecap="round"/>
</svg>
) : (
<svg width="20" height="20" viewBox="0 0 24 24" fill="none">
<path d="M21 12.79A9 9 0 1 1 11.21 3 7 7 0 0 0 21 12.79z" fill={theme.accent} stroke={theme.accent} strokeWidth="1"/>
</svg>
)}
</button>
<Icon />
</Button>
);
};

View File

@@ -0,0 +1,22 @@
import { useLocation } from 'react-router-dom';
import { FooterLayout } from './FooterLayout';
import { HeaderLayout } from './HeaderLayout';
import { ContentLayout } from './ContentLayout';
import { KeepAliveViewport } from './KeepAliveViewport';
import { getTabByPath } from '../../router/tabs';
export function AppShell() {
const location = useLocation();
const activeTab = getTabByPath(location.pathname);
return (
<div className="flex min-h-screen flex-col bg-t-bg text-t-text">
<HeaderLayout activeTab={activeTab} />
<ContentLayout tab={activeTab}>
<KeepAliveViewport activeTab={activeTab} />
</ContentLayout>
<FooterLayout />
</div>
);
}

View File

@@ -1,27 +0,0 @@
import React from 'react';
import { useTheme } from '../../contexts';
interface ContentProps {
children: React.ReactNode;
wide?: boolean;
}
export const Content: React.FC<ContentProps> = ({ children, wide = false }) => {
const { theme } = useTheme();
return (
<main
style={{
flex: 1,
padding: '48px 56px',
maxWidth: wide ? 1400 : 1100,
margin: '0 auto',
width: '100%',
position: 'relative',
backgroundColor: theme.bg,
}}
>
{children}
</main>
);
};

View File

@@ -0,0 +1,40 @@
import type { ReactNode } from 'react';
import type { AppTabConfig } from '../../router/tabs';
import { shellFrameClassName } from './shell-config';
interface ContentLayoutProps {
children: ReactNode;
tab: AppTabConfig;
}
const widthClassMap = {
default: 'mx-auto w-full max-w-[1120px]',
wide: 'mx-auto w-full max-w-[1440px]',
full: 'w-full',
} as const;
export function ContentLayout({ children, tab }: ContentLayoutProps) {
const widthClass = widthClassMap[tab.contentWidth];
return (
<main className="flex min-h-0 flex-1 bg-t-bg">
<div
className={[
shellFrameClassName,
'relative flex min-h-0 flex-1 justify-center py-8',
].join(' ')}
>
<div
className={[
'relative flex min-h-0 w-full',
widthClass,
tab.fillHeight ? 'overflow-hidden' : '',
].join(' ')}
>
{children}
</div>
</div>
</main>
);
}

View File

@@ -0,0 +1,38 @@
import { Badge } from '../shadcn/ui/badge';
import { Separator } from '../shadcn/ui/separator';
import { shellFrameClassName, shellMeta } from './shell-config';
export function FooterLayout() {
return (
<footer className="border-t border-t-border bg-t-bg">
<div
className={[
shellFrameClassName,
'flex items-center justify-between gap-6 py-4 text-xs text-t-text3',
].join(' ')}
>
<div className="min-w-0 max-w-[360px]">
<div className="mono mb-1 tracking-[0.18em] text-t-text2">
{shellMeta.productLabel}
</div>
</div>
<div className="flex shrink-0 items-center gap-3 whitespace-nowrap rounded-xl border border-border bg-card px-3 py-2 shadow-sm">
<Badge variant="secondary" className="mono border-0 bg-transparent px-0 py-0 text-[11px] tracking-[0.24em] text-muted-foreground">
{shellMeta.version}
</Badge>
<Separator orientation="vertical" className="h-4 bg-border" />
<Badge variant="outline" className="mono gap-2 border-0 bg-transparent px-0 py-0 text-[11px] tracking-[0.24em] text-[var(--t-green)]">
<span className="size-2 rounded-full bg-[var(--t-green)]" />
{shellMeta.status}
</Badge>
<Separator orientation="vertical" className="h-4 bg-border" />
<span className="mono text-[11px] tracking-[0.18em] text-muted-foreground">
{shellMeta.surface}
</span>
</div>
</div>
</footer>
);
}

View File

@@ -1,47 +0,0 @@
import React from 'react';
import { useTheme } from '../../contexts';
import { TLogo } from '../common/TLogo';
import { ThemeToggle } from '../common/ThemeToggle';
export const Header: React.FC = () => {
const { theme } = useTheme();
return (
<header
className="h-[72px] flex items-center justify-between sticky top-0 z-[100]"
style={{
padding: '0 48px',
borderBottom: `1px solid ${theme.border}`,
backgroundColor: theme.bg,
}}
>
<div className="flex items-center" style={{ gap: 20 }}>
<TLogo size={80} />
<div className="flex items-baseline" style={{ gap: 12 }}>
<span style={{ fontWeight: 700, fontSize: 20, letterSpacing: '-0.5px', color: theme.text }}>
T-Systems
</span>
<span style={{ fontWeight: 300, fontSize: 16, color: theme.text2 }}>
Regulation
</span>
</div>
</div>
<div className="flex items-center" style={{ gap: 16 }}>
<ThemeToggle />
<div
className="flex items-center rounded-lg"
style={{
padding: '8px 16px',
gap: 8,
backgroundColor: theme.bgHover,
borderRadius: 8,
}}
>
<span className="mono" style={{ fontSize: 11, color: theme.text3 }}>v1.0.0</span>
<div style={{ width: 1, height: 12, background: theme.border }} />
<span className="mono" style={{ fontSize: 12, color: theme.green }}> ONLINE</span>
</div>
</div>
</header>
);
};

View File

@@ -0,0 +1,19 @@
import { TLogo } from '../common/TLogo';
export function HeaderBrand() {
return (
<div className="flex min-w-[280px] shrink-0 items-center gap-4 whitespace-nowrap">
<div className="shrink-0">
<TLogo size={46} />
</div>
<div className="flex min-w-0 items-center gap-2 whitespace-nowrap">
<span className="text-[1.18rem] font-semibold tracking-[-0.04em] text-foreground">
T-Systems
</span>
<span className="text-[1.02rem] font-light text-muted-foreground">
Regulation
</span>
</div>
</div>
);
}

View File

@@ -0,0 +1,38 @@
import type { AppTabConfig } from '../../router/tabs';
import { ThemeToggle } from '../common/ThemeToggle';
import { Badge } from '../shadcn/ui/badge';
import { Separator } from '../shadcn/ui/separator';
import { HeaderBrand } from './HeaderBrand';
import { shellFrameClassName, shellMeta } from './shell-config';
import { TabNav } from './TabNav';
interface HeaderLayoutProps {
activeTab: AppTabConfig;
}
export function HeaderLayout({ activeTab }: HeaderLayoutProps) {
return (
<header className="sticky top-0 z-[100] border-b border-border bg-background/95 backdrop-blur supports-[backdrop-filter]:bg-background/80">
<div className={[shellFrameClassName, 'flex h-20 items-center gap-8'].join(' ')}>
<HeaderBrand />
<div className="min-w-0 flex-1 self-stretch overflow-hidden">
<TabNav activeTab={activeTab} />
</div>
<div className="ml-auto flex shrink-0 items-center gap-3 self-center">
<ThemeToggle />
<div className="flex h-11 shrink-0 items-center gap-3 whitespace-nowrap rounded-xl border border-border bg-card px-3 shadow-sm">
<Badge variant="secondary" className="mono border-0 bg-transparent px-0 py-0 text-[11px] tracking-[0.24em] text-muted-foreground">
{shellMeta.version}
</Badge>
<Separator orientation="vertical" className="h-4 bg-border" />
<Badge variant="outline" className="mono gap-2 border-0 bg-transparent px-0 py-0 text-[11px] tracking-[0.24em] text-[var(--t-green)]">
<span className="size-2 rounded-full bg-[var(--t-green)]" />
{shellMeta.status}
</Badge>
</div>
</div>
</div>
</header>
);
}

View File

@@ -0,0 +1,45 @@
import { useEffect, useState } from 'react';
import { appTabs, type AppTabConfig } from '../../router/tabs';
interface KeepAliveViewportProps {
activeTab: AppTabConfig;
}
export function KeepAliveViewport({ activeTab }: KeepAliveViewportProps) {
const [mountedTabIds, setMountedTabIds] = useState<string[]>([activeTab.id]);
useEffect(() => {
const timerId = window.setTimeout(() => {
setMountedTabIds((prev) => (prev.includes(activeTab.id) ? prev : [...prev, activeTab.id]));
}, 0);
return () => window.clearTimeout(timerId);
}, [activeTab.id]);
return (
<div className="flex min-h-0 flex-1">
{appTabs.map((tab) => {
const shouldRender = tab.keepAlive ? mountedTabIds.includes(tab.id) : tab.id === activeTab.id;
if (!shouldRender) {
return null;
}
const TabComponent = tab.component;
const isActive = tab.id === activeTab.id;
return (
<div
key={tab.id}
aria-hidden={!isActive}
className={[
'min-h-0 flex-1',
isActive ? 'flex' : 'hidden',
].join(' ')}
>
<TabComponent />
</div>
);
})}
</div>
);
}

View File

@@ -0,0 +1,125 @@
import { useEffect, useLayoutEffect, useRef, useState } from 'react';
import { useNavigate } from 'react-router-dom';
import type { AppTabConfig, TabId } from '../../router/tabs';
import { appTabs } from '../../router/tabs';
interface TabNavProps {
activeTab: AppTabConfig;
}
interface IndicatorStyle {
opacity: number;
transform: string;
width: number;
}
const reducedMotionQuery = '(prefers-reduced-motion: reduce)';
export function TabNav({ activeTab }: TabNavProps) {
const navigate = useNavigate();
const trackRef = useRef<HTMLDivElement | null>(null);
const buttonRefs = useRef<Record<TabId, HTMLButtonElement | null>>({
perception: null,
docs: null,
compliance: null,
status: null,
rag: null,
});
const [indicatorStyle, setIndicatorStyle] = useState<IndicatorStyle>({
opacity: 0,
transform: 'translateX(0px)',
width: 0,
});
const [reducedMotion, setReducedMotion] = useState(false);
const handleValueChange = (value: string) => {
const nextTab = appTabs.find((tab) => tab.id === value);
if (nextTab && nextTab.path !== activeTab.path) {
navigate(nextTab.path);
}
};
useEffect(() => {
const mediaQuery = window.matchMedia(reducedMotionQuery);
const updateMotionPreference = () => {
setReducedMotion(mediaQuery.matches);
};
updateMotionPreference();
mediaQuery.addEventListener('change', updateMotionPreference);
return () => {
mediaQuery.removeEventListener('change', updateMotionPreference);
};
}, []);
useLayoutEffect(() => {
const updateIndicator = () => {
const trackNode = trackRef.current;
const activeNode = buttonRefs.current[activeTab.id];
if (!trackNode || !activeNode) {
return;
}
const trackRect = trackNode.getBoundingClientRect();
const activeRect = activeNode.getBoundingClientRect();
setIndicatorStyle({
opacity: 1,
transform: `translateX(${activeRect.left - trackRect.left}px)`,
width: activeRect.width,
});
};
updateIndicator();
window.addEventListener('resize', updateIndicator);
return () => {
window.removeEventListener('resize', updateIndicator);
};
}, [activeTab.id]);
return (
<nav className="flex h-full min-w-0 items-stretch overflow-x-auto overflow-y-hidden">
<div
ref={trackRef}
className="relative flex h-full min-w-max flex-nowrap items-stretch gap-3 pr-6"
>
<div
aria-hidden="true"
className={[
'pointer-events-none absolute bottom-0 left-0 h-0.5 rounded-full bg-primary',
reducedMotion
? 'transition-none'
: 'transition-[transform,width,opacity] duration-220 ease-[cubic-bezier(0.22,1,0.36,1)]',
].join(' ')}
style={indicatorStyle}
/>
{appTabs.map((tab) => (
<button
key={tab.id}
ref={(node) => {
buttonRefs.current[tab.id] = node;
}}
data-shell-tab="true"
type="button"
onClick={() => handleValueChange(tab.id satisfies TabId)}
aria-current={tab.id === activeTab.id ? 'page' : undefined}
className={[
'inline-flex h-full shrink-0 appearance-none items-center justify-center border-0 border-b-2 border-transparent bg-transparent px-5 pt-1 text-[0.95rem] font-medium tracking-[0.02em] outline-none',
reducedMotion
? 'transition-none'
: 'transition-[color,opacity] duration-200 ease-out',
tab.id === activeTab.id
? 'text-foreground'
: 'text-muted-foreground hover:text-foreground',
].join(' ')}
>
{tab.label}
</button>
))}
</div>
</nav>
);
}

View File

@@ -1,48 +0,0 @@
import React from 'react';
import { useTheme, useApp } from '../../contexts';
import type { TabId } from '../../contexts';
const tabs: Array<{ id: TabId; label: string }> = [
{ id: 'docs', label: '文档管理' },
{ id: 'compliance', label: '合规分析' },
{ id: 'status', label: '系统状态' },
{ id: 'rag', label: '法规对话' },
];
export const Tabs: React.FC = () => {
const { theme } = useTheme();
const { activeTab, setActiveTab } = useApp();
return (
<nav
className="h-[56px] flex items-center"
style={{
padding: '0 48px',
borderBottom: `1px solid ${theme.border}`,
backgroundColor: theme.bg,
}}
>
{tabs.map((tab) => (
<button
key={tab.id}
onClick={() => setActiveTab(tab.id)}
style={{
height: 56,
padding: '0 32px',
fontSize: 15,
fontWeight: activeTab === tab.id ? 600 : 400,
color: activeTab === tab.id ? theme.accent : theme.text3,
background: 'transparent',
border: 'none',
borderBottom: activeTab === tab.id ? `3px solid ${theme.accent}` : '3px solid transparent',
marginBottom: -1,
cursor: 'pointer',
transition: 'all 0.2s ease',
}}
>
{tab.label}
</button>
))}
</nav>
);
};

View File

@@ -1,3 +1,4 @@
export { Header } from './Header';
export { Tabs } from './Tabs';
export { Content } from './Content';
export { AppShell } from './AppShell';
export { ContentLayout } from './ContentLayout';
export { FooterLayout } from './FooterLayout';
export { HeaderLayout } from './HeaderLayout';

View File

@@ -0,0 +1,12 @@
import { appTabs } from '../../router/tabs';
export const shellFrameClassName = 'mx-auto w-full max-w-[1680px] px-8';
export const shellMeta = {
productLabel: 'T-Systems Regulation',
version: 'v1.0.0',
status: 'ONLINE',
surface: 'Desktop Web',
} as const;
export const shellModuleSummary = appTabs.map((tab) => tab.label).join(' / ');

View File

@@ -0,0 +1,30 @@
import { cva, type VariantProps } from 'class-variance-authority';
import type * as React from 'react';
import { cn } from '@/lib/utils';
const badgeVariants = cva(
'inline-flex items-center rounded-md border px-2 py-1 text-[11px] font-medium tracking-[0.22em] uppercase transition-colors',
{
variants: {
variant: {
default: 'border-primary/30 bg-primary/10 text-primary',
secondary: 'border-border bg-muted text-muted-foreground',
outline: 'border-border bg-transparent text-foreground',
},
},
defaultVariants: {
variant: 'default',
},
},
);
function Badge({
className,
variant,
...props
}: React.ComponentProps<'span'> & VariantProps<typeof badgeVariants>) {
return <span className={cn(badgeVariants({ variant }), className)} {...props} />;
}
export { Badge };

View File

@@ -0,0 +1,65 @@
import * as React from 'react';
import { cva, type VariantProps } from 'class-variance-authority';
import { Slot } from 'radix-ui';
import { cn } from '@/lib/utils';
const buttonVariants = cva(
'group/button inline-flex shrink-0 items-center justify-center rounded-lg border border-transparent bg-clip-padding text-sm font-medium whitespace-nowrap transition-all outline-none select-none focus-visible:border-ring focus-visible:ring-3 focus-visible:ring-ring/50 active:not-aria-[haspopup]:translate-y-px disabled:pointer-events-none disabled:opacity-50 aria-invalid:border-destructive aria-invalid:ring-3 aria-invalid:ring-destructive/20 dark:aria-invalid:border-destructive/50 dark:aria-invalid:ring-destructive/40 [&_svg]:pointer-events-none [&_svg]:shrink-0 [&_svg:not([class*=size-])]:size-4',
{
variants: {
variant: {
default: 'bg-primary text-primary-foreground hover:bg-primary/90',
outline:
'border-border bg-background hover:bg-muted hover:text-foreground aria-expanded:bg-muted aria-expanded:text-foreground dark:bg-input/30 dark:hover:bg-input/50',
secondary:
'bg-secondary text-secondary-foreground hover:bg-secondary/80 aria-expanded:bg-secondary aria-expanded:text-secondary-foreground',
ghost:
'hover:bg-muted hover:text-foreground aria-expanded:bg-muted aria-expanded:text-foreground dark:hover:bg-muted/50',
destructive:
'bg-destructive/10 text-destructive hover:bg-destructive/20 focus-visible:border-destructive/40 focus-visible:ring-destructive/20 dark:bg-destructive/20 dark:hover:bg-destructive/30 dark:focus-visible:ring-destructive/40',
link: 'text-primary underline-offset-4 hover:underline',
},
size: {
default:
'h-8 gap-1.5 px-2.5 has-data-[icon=inline-end]:pr-2 has-data-[icon=inline-start]:pl-2',
xs: 'h-6 gap-1 rounded-[min(var(--radius-md),10px)] px-2 text-xs has-data-[icon=inline-end]:pr-1.5 has-data-[icon=inline-start]:pl-1.5 [&_svg:not([class*=size-])]:size-3',
sm: 'h-7 gap-1 rounded-[min(var(--radius-md),12px)] px-2.5 text-[0.8rem] has-data-[icon=inline-end]:pr-1.5 has-data-[icon=inline-start]:pl-1.5 [&_svg:not([class*=size-])]:size-3.5',
lg: 'h-9 gap-1.5 px-3 has-data-[icon=inline-end]:pr-2 has-data-[icon=inline-start]:pl-2',
icon: 'size-8',
'icon-xs': 'size-6 rounded-[min(var(--radius-md),10px)] [&_svg:not([class*=size-])]:size-3',
'icon-sm': 'size-7 rounded-[min(var(--radius-md),12px)]',
'icon-lg': 'size-9',
},
},
defaultVariants: {
variant: 'default',
size: 'default',
},
},
);
function Button({
className,
variant,
size,
asChild = false,
...props
}: React.ComponentProps<'button'> &
VariantProps<typeof buttonVariants> & {
asChild?: boolean;
}) {
const Comp = asChild ? Slot.Root : 'button';
return (
<Comp
data-slot="button"
data-variant={variant}
data-size={size}
className={cn(buttonVariants({ variant, size, className }))}
{...props}
/>
);
}
export { Button };

View File

@@ -0,0 +1,27 @@
import * as React from 'react';
import { Separator as SeparatorPrimitive } from 'radix-ui';
import { cn } from '@/lib/utils';
function Separator({
className,
orientation = 'horizontal',
decorative = true,
...props
}: React.ComponentProps<typeof SeparatorPrimitive.Root>) {
return (
<SeparatorPrimitive.Root
data-slot="separator-root"
decorative={decorative}
orientation={orientation}
className={cn(
'shrink-0 bg-border',
orientation === 'horizontal' ? 'h-px w-full' : 'h-full w-px',
className,
)}
{...props}
/>
);
}
export { Separator };

View File

@@ -0,0 +1,48 @@
import * as React from 'react';
import { Tabs as TabsPrimitive } from 'radix-ui';
import { cn } from '@/lib/utils';
function Tabs({
className,
...props
}: React.ComponentProps<typeof TabsPrimitive.Root>) {
return (
<TabsPrimitive.Root
data-slot="tabs"
className={cn('w-full', className)}
{...props}
/>
);
}
function TabsList({
className,
...props
}: React.ComponentProps<typeof TabsPrimitive.List>) {
return (
<TabsPrimitive.List
data-slot="tabs-list"
className={cn('inline-flex items-center gap-2', className)}
{...props}
/>
);
}
function TabsTrigger({
className,
...props
}: React.ComponentProps<typeof TabsPrimitive.Trigger>) {
return (
<TabsPrimitive.Trigger
data-slot="tabs-trigger"
className={cn(
'inline-flex items-center justify-center whitespace-nowrap outline-none transition-colors focus-visible:ring-2 focus-visible:ring-ring focus-visible:ring-offset-2 disabled:pointer-events-none disabled:opacity-50',
className,
)}
{...props}
/>
);
}
export { Tabs, TabsList, TabsTrigger };

View File

@@ -1,17 +0,0 @@
import { useState, type ReactNode } from 'react';
import { AppContext, type TabId } from './app-context';
interface AppProviderProps {
children: ReactNode;
}
export const AppProvider: React.FC<AppProviderProps> = ({ children }) => {
const [activeTab, setActiveTab] = useState<TabId>('compliance');
return (
<AppContext.Provider value={{ activeTab, setActiveTab }}>
{children}
</AppContext.Provider>
);
};

View File

@@ -1,33 +1,58 @@
import React, { useEffect, useState, type ReactNode } from 'react';
import { darkTheme, lightTheme } from '../types/theme';
import { darkTheme, dimTheme, lightTheme, type ThemeMode } from '../types/theme';
import { ThemeContext } from './theme-context';
const STORAGE_KEY = 'app-theme-mode';
function getInitialMode(): ThemeMode {
try {
const stored = localStorage.getItem(STORAGE_KEY);
if (stored === 'dark' || stored === 'dim' || stored === 'light') return stored;
} catch {
// ignore
}
return 'dark';
}
const THEME_MAP = { dark: darkTheme, dim: dimTheme, light: lightTheme };
const BG_MAP: Record<ThemeMode, string> = {
dark: '#0a0a12',
dim: '#1e1b2e',
light: '#ffffff',
};
interface ThemeProviderProps {
children: ReactNode;
}
export const ThemeProvider: React.FC<ThemeProviderProps> = ({ children }) => {
const [isDark, setIsDark] = useState<boolean>(true);
const theme = isDark ? darkTheme : lightTheme;
const [themeMode, setThemeMode] = useState<ThemeMode>(getInitialMode);
const theme = THEME_MAP[themeMode];
const isDark = themeMode === 'dark';
const toggleTheme = () => {
setIsDark((prev) => !prev);
setThemeMode((prev) =>
prev === 'dark' ? 'dim' : prev === 'dim' ? 'light' : 'dark'
);
};
useEffect(() => {
if (isDark) {
document.documentElement.classList.add('dark');
document.body.style.background = '#0a0a12';
return;
}
const root = document.documentElement;
root.classList.remove('dark', 'dim');
if (themeMode !== 'light') root.classList.add(themeMode);
document.documentElement.classList.remove('dark');
document.body.style.background = '#ffffff';
}, [isDark]);
document.body.style.background = BG_MAP[themeMode];
try {
localStorage.setItem(STORAGE_KEY, themeMode);
} catch {
// ignore
}
}, [themeMode]);
return (
<ThemeContext.Provider value={{ isDark, theme, toggleTheme }}>
<ThemeContext.Provider value={{ isDark, themeMode, theme, toggleTheme }}>
{children}
</ThemeContext.Provider>
);

View File

@@ -1,10 +0,0 @@
import { createContext } from 'react';
export type TabId = 'docs' | 'compliance' | 'status' | 'rag';
export interface AppContextValue {
activeTab: TabId;
setActiveTab: (tab: TabId) => void;
}
export const AppContext = createContext<AppContextValue | undefined>(undefined);

View File

@@ -1,5 +1,2 @@
export { ThemeProvider } from './ThemeContext';
export { useTheme } from './useTheme';
export { AppProvider } from './AppContext';
export type { AppContextValue, TabId } from './app-context';
export { useApp } from './useApp';

View File

@@ -1,9 +1,10 @@
import { createContext } from 'react';
import type { ThemeColors } from '../types/theme';
import type { ThemeColors, ThemeMode } from '../types/theme';
export interface ThemeContextValue {
isDark: boolean;
themeMode: ThemeMode;
theme: ThemeColors;
toggleTheme: () => void;
}

View File

@@ -1,11 +0,0 @@
import { useContext } from 'react';
import { AppContext, type AppContextValue } from './app-context';
export function useApp(): AppContextValue {
const context = useContext(AppContext);
if (!context) {
throw new Error('useApp must be used within an AppProvider');
}
return context;
}

View File

@@ -587,7 +587,7 @@ export const CompliancePage: React.FC = () => {
flex: 1,
display: 'flex',
height: '100%',
minHeight: 'calc(100vh - 128px)',
minHeight: 0,
position: 'relative',
}}>
{/* Main Content Area */}

View File

@@ -1,6 +1,5 @@
import React, { useEffect, useRef, useState } from 'react';
import { useTheme } from '../../contexts';
import { Content } from '../../components/layout/Content';
import { TPattern } from '../../components/common/TPattern';
import { getDocumentList, getDocumentStatus, searchRegulations, uploadDocument, deleteDocument, retryDocument, type RegulationSearchItem } from '../../api/docs';
import type { Doc } from '../../types';
@@ -40,6 +39,7 @@ export const DocsPage: React.FC = () => {
const [searchResults, setSearchResults] = useState<RegulationSearchItem[]>([]);
const [searchLoading, setSearchLoading] = useState(false);
const [searchError, setSearchError] = useState('');
const [batchQueueLength, setBatchQueueLength] = useState(0);
// Upload metadata
const [regulationType, setRegulationType] = useState('');
@@ -48,12 +48,17 @@ export const DocsPage: React.FC = () => {
// Batch queue: files waiting to be uploaded after the current one finishes
const batchQueueRef = useRef<File[]>([]);
const setBatchQueue = (files: File[]) => {
batchQueueRef.current = files;
setBatchQueueLength(files.length);
};
async function loadDocuments() {
setLoading(true);
try {
const response = await getDocumentList();
const apiDocs: Doc[] = response.docs.map((doc) => ({
id: parseInt(String(doc.id).replace('doc-', ''), 10) || Math.floor(Math.random() * 10000),
const apiDocs: Doc[] = response.docs.map((doc, index) => ({
id: Number.parseInt(String(doc.id).replace('doc-', ''), 10) || -(index + 1),
name: doc.name,
chunks: doc.chunks,
size: doc.updated_at ? new Date(doc.updated_at).toLocaleString() : 'Indexed document',
@@ -209,6 +214,7 @@ export const DocsPage: React.FC = () => {
// Process next file in batch queue
const next = batchQueueRef.current.shift();
setBatchQueueLength(batchQueueRef.current.length);
if (next) {
const nextRunId = pipelineRunIdRef.current + 1;
pipelineRunIdRef.current = nextRunId;
@@ -222,7 +228,7 @@ export const DocsPage: React.FC = () => {
if (files.length === 0 || uploading) return;
const [first, ...rest] = files;
batchQueueRef.current = rest;
setBatchQueue(rest);
const runId = pipelineRunIdRef.current + 1;
pipelineRunIdRef.current = runId;
@@ -262,7 +268,7 @@ export const DocsPage: React.FC = () => {
if (files.length === 0 || uploading) return;
const [first, ...rest] = files;
batchQueueRef.current = rest;
setBatchQueue(rest);
const runId = pipelineRunIdRef.current + 1;
pipelineRunIdRef.current = runId;
void uploadSingleFile(first, runId);
@@ -282,8 +288,7 @@ export const DocsPage: React.FC = () => {
const getPipelineHint = () => {
if (pipelineStatus === 'running') {
const queueLen = batchQueueRef.current.length;
const suffix = queueLen > 0 ? ` (+${queueLen} 待上传)` : '';
const suffix = batchQueueLength > 0 ? ` (+${batchQueueLength} 待上传)` : '';
return `${activeStep >= 0 ? PIPELINE_STEPS[activeStep].name : 'LOAD'} · ${uploadFileName}${suffix}`;
}
if (pipelineStatus === 'completed') return 'PIPELINE COMPLETE';
@@ -291,6 +296,11 @@ export const DocsPage: React.FC = () => {
return 'WAITING FOR UPLOAD';
};
const getDocKey = (doc: Doc) => {
// Prefer the backend document identifier because the numeric display id is not guaranteed unique.
return doc.docId ?? `local-${doc.id}-${doc.name}`;
};
const inputStyle: React.CSSProperties = {
padding: '8px 12px',
fontSize: 13,
@@ -302,7 +312,7 @@ export const DocsPage: React.FC = () => {
};
return (
<Content>
<div className="relative w-full">
<TPattern />
<section style={{ marginBottom: 56 }}>
@@ -432,7 +442,7 @@ export const DocsPage: React.FC = () => {
<div style={{ display: 'flex', flexDirection: 'column', gap: 12 }}>
{docs.map((doc) => (
<div
key={doc.id}
key={getDocKey(doc)}
style={{ display: 'flex', alignItems: 'center', justifyContent: 'space-between', padding: 20, background: theme.bgCard, borderRadius: 12, border: `1px solid ${doc.status === 'parsing' ? theme.accent : theme.border}`, transition: 'all 0.2s ease', boxShadow: !isDark ? '0 2px 8px rgba(226,0,116,0.04)' : 'none' }}
>
<div style={{ display: 'flex', alignItems: 'flex-start', gap: 16 }}>
@@ -543,6 +553,6 @@ export const DocsPage: React.FC = () => {
</div>
</div>
</section>
</Content>
</div>
);
};

View File

@@ -0,0 +1,207 @@
import React, { useRef } from 'react';
import { useTheme } from '../../contexts';
import type { RegulationEvent, AffectedDoc } from '../../api/perception';
interface AnalysisPanelProps {
event: RegulationEvent | null;
analyzing: boolean;
analysisText: string;
affectedDocs: AffectedDoc[];
onAnalyze: () => void;
onAbort: () => void;
}
// Minimal markdown renderer — handles ##/### headings, **bold**, bullet lists
function MarkdownText({ text, textColor, accent }: { text: string; textColor: string; accent: string }) {
const lines = text.split('\n');
return (
<div style={{ fontSize: 14, lineHeight: 1.75, color: textColor }}>
{lines.map((line, i) => {
if (line.startsWith('## ')) {
return <div key={i} style={{ fontSize: 15, fontWeight: 700, color: accent, marginTop: 18, marginBottom: 6 }}>{line.slice(3)}</div>;
}
if (line.startsWith('### ')) {
return <div key={i} style={{ fontSize: 13, fontWeight: 700, marginTop: 12, marginBottom: 4 }}>{line.slice(4)}</div>;
}
if (line.startsWith('- ') || line.startsWith('* ')) {
const content = line.slice(2);
return (
<div key={i} style={{ display: 'flex', gap: 8, marginBottom: 4, paddingLeft: 8 }}>
<span style={{ color: accent, flexShrink: 0 }}>·</span>
<span dangerouslySetInnerHTML={{ __html: content.replace(/\*\*(.+?)\*\*/g, '<strong>$1</strong>') }} />
</div>
);
}
if (/^\d+\./.test(line)) {
return (
<div key={i} style={{ marginBottom: 4, paddingLeft: 8 }}>
<span dangerouslySetInnerHTML={{ __html: line.replace(/\*\*(.+?)\*\*/g, '<strong>$1</strong>') }} />
</div>
);
}
if (!line.trim()) return <div key={i} style={{ height: 8 }} />;
return (
<div key={i} style={{ marginBottom: 4 }}>
<span dangerouslySetInnerHTML={{ __html: line.replace(/\*\*(.+?)\*\*/g, '<strong>$1</strong>') }} />
</div>
);
})}
</div>
);
}
const IMPACT_COLORS = { high: '#d64545', medium: '#ff8800', low: '#00d4aa' };
const SOURCE_COLORS: Record<string, string> = {
MIIT: '#e20074', 'UN-ECE': '#4a90d9', ISO: '#7b68ee',
'国标委': '#00b89c', 'EUR-Lex': '#f5a623', IATF: '#9b59b6',
};
const STATUS_LABEL: Record<string, string> = { enacted: '已生效', draft: '征求意见', consultation: '公众咨询' };
export const AnalysisPanel: React.FC<AnalysisPanelProps> = ({
event, analyzing, analysisText, affectedDocs, onAnalyze, onAbort,
}) => {
const { theme, isDark } = useTheme();
const analysisRef = useRef<HTMLDivElement>(null);
if (!event) {
return (
<div style={{ height: '100%', display: 'flex', flexDirection: 'column', alignItems: 'center', justifyContent: 'center', gap: 12 }}>
<div style={{ fontSize: 48, opacity: 0.15 }}></div>
<div style={{ fontSize: 14, color: theme.text3 }}></div>
</div>
);
}
const impactColor = IMPACT_COLORS[event.impact_level];
const srcColor = SOURCE_COLORS[event.source] || theme.accent;
return (
<div style={{ display: 'flex', flexDirection: 'column', height: '100%', gap: 0 }}>
{/* Event header */}
<div style={{
padding: '20px 24px',
background: theme.bgCard,
borderRadius: 12,
border: `1px solid ${theme.border}`,
borderLeft: `4px solid ${impactColor}`,
marginBottom: 16,
flexShrink: 0,
boxShadow: !isDark ? '0 2px 8px rgba(226,0,116,0.04)' : 'none',
}}>
{/* Source + status */}
<div style={{ display: 'flex', alignItems: 'center', gap: 8, marginBottom: 10 }}>
<span style={{ fontSize: 11, fontWeight: 700, color: srcColor, background: srcColor + '18', borderRadius: 4, padding: '3px 8px' }}>{event.source}</span>
<span className="mono" style={{ fontSize: 11, color: theme.text3 }}>{event.standard_code}</span>
<span style={{ marginLeft: 'auto', fontSize: 11, color: event.status === 'enacted' ? theme.green : '#ff8800', fontWeight: 600 }}>
{STATUS_LABEL[event.status] ?? event.status}
</span>
</div>
{/* Title */}
<div style={{ fontSize: 16, fontWeight: 700, color: theme.text, lineHeight: 1.4, marginBottom: 10 }}>
{event.title}
</div>
{/* Summary */}
<div style={{ fontSize: 13, color: theme.text2, lineHeight: 1.6, marginBottom: 12 }}>
{event.summary}
</div>
{/* Tags */}
<div style={{ display: 'flex', flexWrap: 'wrap', gap: 6, marginBottom: 12 }}>
{event.tags.map(tag => (
<span key={tag} style={{ fontSize: 11, color: theme.text3, background: theme.bgHover, borderRadius: 4, padding: '2px 8px', border: `1px solid ${theme.border}` }}>
{tag}
</span>
))}
</div>
{/* Dates + Analyze button */}
<div style={{ display: 'flex', alignItems: 'center', justifyContent: 'space-between' }}>
<div className="mono" style={{ fontSize: 11, color: theme.text3 }}>
{event.published_at}
{event.effective_at && <span style={{ marginLeft: 12 }}><span style={{ color: impactColor }}>{event.effective_at}</span></span>}
</div>
{analyzing ? (
<button onClick={onAbort} style={{ padding: '7px 18px', borderRadius: 8, border: '1px solid #d64545', background: 'transparent', color: '#d64545', cursor: 'pointer', fontSize: 13, fontWeight: 600 }}>
</button>
) : (
<button onClick={onAnalyze} style={{ padding: '7px 18px', borderRadius: 8, border: 'none', background: theme.gradientAccent, color: '#fff', cursor: 'pointer', fontSize: 13, fontWeight: 600, boxShadow: '0 2px 8px rgba(226,0,116,0.3)' }}>
</button>
)}
</div>
</div>
{/* Affected documents */}
{affectedDocs.length > 0 && (
<div style={{ marginBottom: 16, flexShrink: 0 }}>
<div className="mono" style={{ fontSize: 11, color: theme.text3, letterSpacing: '1px', marginBottom: 8 }}>
{affectedDocs.length}
</div>
<div style={{ display: 'flex', flexDirection: 'column', gap: 6 }}>
{affectedDocs.map(doc => (
<div key={doc.doc_id} style={{
padding: '10px 14px',
background: theme.bgCard,
border: `1px solid ${theme.border}`,
borderLeft: `3px solid ${theme.accent}`,
borderRadius: 8,
display: 'flex',
alignItems: 'flex-start',
gap: 10,
}}>
<div style={{ flex: 1, minWidth: 0 }}>
<div style={{ fontSize: 13, fontWeight: 600, color: theme.text, overflow: 'hidden', textOverflow: 'ellipsis', whiteSpace: 'nowrap' }}>
{doc.doc_name}
</div>
{doc.snippet && (
<div style={{ fontSize: 12, color: theme.text3, marginTop: 3, overflow: 'hidden', textOverflow: 'ellipsis', whiteSpace: 'nowrap' }}>
{doc.snippet}
</div>
)}
</div>
<span className="mono" style={{ fontSize: 11, color: theme.accent, flexShrink: 0 }}>
{Math.round(doc.score * 100)}%
</span>
</div>
))}
</div>
</div>
)}
{/* Streaming analysis output */}
{(analysisText || analyzing) && (
<div ref={analysisRef} style={{
flex: 1,
overflowY: 'auto',
padding: '20px 24px',
background: theme.bgCard,
border: `1px solid ${theme.border}`,
borderRadius: 12,
boxShadow: !isDark ? '0 2px 8px rgba(0,0,0,0.03)' : 'none',
}}>
<div className="mono" style={{ fontSize: 11, color: theme.accent, letterSpacing: '1px', marginBottom: 14 }}>
ANALYSIS {analyzing && <span style={{ animation: 'blink 1s step-end infinite' }}></span>}
</div>
{analysisText && (
<MarkdownText text={analysisText} textColor={theme.text2} accent={theme.accent} />
)}
{analyzing && !analysisText && (
<div style={{ color: theme.text3, fontSize: 13 }}>...</div>
)}
</div>
)}
{/* Empty analysis state */}
{!analysisText && !analyzing && (
<div style={{ flex: 1, display: 'flex', alignItems: 'center', justifyContent: 'center' }}>
<div style={{ textAlign: 'center', color: theme.text3, fontSize: 13 }}>
AI
</div>
</div>
)}
</div>
);
};

View File

@@ -0,0 +1,157 @@
import React from 'react';
import { useTheme } from '../../contexts';
import type { RegulationEvent, ImpactLevel, EventSource } from '../../api/perception';
const IMPACT_CONFIG: Record<ImpactLevel, { color: string; label: string; dot: string }> = {
high: { color: '#d64545', label: '高影响', dot: '●' },
medium: { color: '#ff8800', label: '中影响', dot: '●' },
low: { color: '#00d4aa', label: '低影响', dot: '●' },
};
const STATUS_LABEL: Record<string, string> = {
enacted: '已生效',
draft: '征求意见',
consultation: '公众咨询',
};
const SOURCE_COLORS: Record<string, string> = {
MIIT: '#e20074',
'UN-ECE': '#4a90d9',
ISO: '#7b68ee',
'国标委': '#00b89c',
'EUR-Lex': '#f5a623',
IATF: '#9b59b6',
};
interface EventFeedProps {
events: RegulationEvent[];
selectedId: string | null;
onSelect: (id: string) => void;
filterSource: string;
filterImpact: string;
onFilterSource: (v: string) => void;
onFilterImpact: (v: string) => void;
stats: { total: number; high_impact: number; medium_impact: number; low_impact: number; recent_90d: number } | null;
loading: boolean;
}
export const EventFeed: React.FC<EventFeedProps> = ({
events, selectedId, onSelect,
filterSource, filterImpact, onFilterSource, onFilterImpact,
stats, loading,
}) => {
const { theme, isDark } = useTheme();
const sources: EventSource[] = ['MIIT', 'UN-ECE', 'ISO', '国标委', 'EUR-Lex', 'IATF'];
const impacts: ImpactLevel[] = ['high', 'medium', 'low'];
return (
<div style={{ display: 'flex', flexDirection: 'column', height: '100%', gap: 16 }}>
{/* KPI mini-cards */}
{stats && (
<div style={{ display: 'grid', gridTemplateColumns: 'repeat(4, 1fr)', gap: 8 }}>
{[
{ label: '总计', value: stats.total, color: theme.text },
{ label: '高影响', value: stats.high_impact, color: '#d64545' },
{ label: '中影响', value: stats.medium_impact, color: '#ff8800' },
{ label: '近90天', value: stats.recent_90d, color: theme.accent },
].map(({ label, value, color }) => (
<div key={label} style={{
padding: '10px 12px',
background: theme.bgCard,
border: `1px solid ${theme.border}`,
borderRadius: 10,
position: 'relative',
overflow: 'hidden',
boxShadow: 'none',
}}>
<div style={{ position: 'absolute', top: 0, left: 0, right: 0, height: 2, background: color }} />
<div className="mono" style={{ fontSize: 10, color: theme.text3, letterSpacing: '0.5px' }}>{label}</div>
<div className="mono" style={{ fontSize: 22, fontWeight: 700, color }}>{value}</div>
</div>
))}
</div>
)}
{/* Filter row */}
<div style={{ display: 'flex', gap: 6, flexWrap: 'wrap' }}>
<button
onClick={() => onFilterSource('')}
style={{ padding: '4px 10px', borderRadius: 20, border: `1px solid ${filterSource === '' ? theme.accent : theme.border}`, background: filterSource === '' ? theme.accent + '20' : 'transparent', color: filterSource === '' ? theme.accent : theme.text3, fontSize: 11, cursor: 'pointer' }}
></button>
{sources.map(s => (
<button key={s} onClick={() => onFilterSource(filterSource === s ? '' : s)}
style={{ padding: '4px 10px', borderRadius: 20, border: `1px solid ${filterSource === s ? (SOURCE_COLORS[s] || theme.accent) : theme.border}`, background: filterSource === s ? (SOURCE_COLORS[s] || theme.accent) + '20' : 'transparent', color: filterSource === s ? (SOURCE_COLORS[s] || theme.accent) : theme.text3, fontSize: 11, cursor: 'pointer' }}>
{s}
</button>
))}
</div>
<div style={{ display: 'flex', gap: 6 }}>
<button onClick={() => onFilterImpact('')}
style={{ padding: '4px 10px', borderRadius: 20, border: `1px solid ${filterImpact === '' ? theme.accent : theme.border}`, background: filterImpact === '' ? theme.accent + '20' : 'transparent', color: filterImpact === '' ? theme.accent : theme.text3, fontSize: 11, cursor: 'pointer' }}>
</button>
{impacts.map(lvl => (
<button key={lvl} onClick={() => onFilterImpact(filterImpact === lvl ? '' : lvl)}
style={{ padding: '4px 10px', borderRadius: 20, border: `1px solid ${filterImpact === lvl ? IMPACT_CONFIG[lvl].color : theme.border}`, background: filterImpact === lvl ? IMPACT_CONFIG[lvl].color + '22' : 'transparent', color: filterImpact === lvl ? IMPACT_CONFIG[lvl].color : theme.text3, fontSize: 11, cursor: 'pointer' }}>
{IMPACT_CONFIG[lvl].dot} {IMPACT_CONFIG[lvl].label}
</button>
))}
</div>
{/* Event list */}
<div style={{ flex: 1, overflowY: 'auto', display: 'flex', flexDirection: 'column', gap: 8 }}>
{loading && (
<div className="mono" style={{ fontSize: 12, color: theme.text3, padding: '16px 0' }}>...</div>
)}
{!loading && events.length === 0 && (
<div style={{ fontSize: 13, color: theme.text3, padding: '32px 0', textAlign: 'center' }}></div>
)}
{events.map(evt => {
const cfg = IMPACT_CONFIG[evt.impact_level];
const isSelected = evt.id === selectedId;
const srcColor = SOURCE_COLORS[evt.source] || theme.accent;
return (
<div
key={evt.id}
onClick={() => onSelect(evt.id)}
style={{
padding: '14px 16px',
background: isSelected ? theme.bgHover : theme.bgCard,
borderRadius: 10,
border: `1px solid ${isSelected ? theme.accent : theme.border}`,
borderLeft: `4px solid ${cfg.color}`,
cursor: 'pointer',
transition: 'all 0.15s ease',
boxShadow: isSelected ? `0 0 0 1px ${theme.accent}40` : 'none',
}}
>
{/* Source + Status row */}
<div style={{ display: 'flex', alignItems: 'center', gap: 6, marginBottom: 6 }}>
<span style={{ fontSize: 10, fontWeight: 700, color: srcColor, background: srcColor + '18', borderRadius: 4, padding: '2px 7px' }}>{evt.source}</span>
<span className="mono" style={{ fontSize: 10, color: theme.text3 }}>{evt.standard_code}</span>
<span style={{ marginLeft: 'auto', fontSize: 10, color: evt.status === 'enacted' ? theme.green : '#ff8800', background: evt.status === 'enacted' ? theme.green + '18' : '#ff880018', borderRadius: 4, padding: '2px 6px', fontWeight: 600 }}>
{STATUS_LABEL[evt.status] ?? evt.status}
</span>
</div>
{/* Title */}
<div style={{ fontSize: 13, fontWeight: 600, color: theme.text, lineHeight: 1.4, marginBottom: 6, display: '-webkit-box', WebkitLineClamp: 2, WebkitBoxOrient: 'vertical', overflow: 'hidden' }}>
{evt.title}
</div>
{/* Date + impact */}
<div style={{ display: 'flex', alignItems: 'center', justifyContent: 'space-between' }}>
<span className="mono" style={{ fontSize: 11, color: theme.text3 }}>
{evt.published_at}{evt.effective_at ? `${evt.effective_at}` : ''}
</span>
<span style={{ fontSize: 10, color: cfg.color, fontWeight: 700 }}>{cfg.dot} {cfg.label}</span>
</div>
</div>
);
})}
</div>
</div>
);
};

View File

@@ -0,0 +1,146 @@
import React, { useCallback, useEffect, useRef, useState } from 'react';
import { useTheme } from '../../contexts';
import { TPattern } from '../../components/common/TPattern';
import {
listEvents,
getPerceptionStats,
analyzeEvent,
type RegulationEvent,
type PerceptionStats,
type AffectedDoc,
} from '../../api/perception';
import { EventFeed } from './EventFeed';
import { AnalysisPanel } from './AnalysisPanel';
export const PerceptionPage: React.FC = () => {
const { theme } = useTheme();
// Feed state
const [events, setEvents] = useState<RegulationEvent[]>([]);
const [stats, setStats] = useState<PerceptionStats | null>(null);
const [feedLoading, setFeedLoading] = useState(true);
const [filterSource, setFilterSource] = useState('');
const [filterImpact, setFilterImpact] = useState('');
// Selected event
const [selectedId, setSelectedId] = useState<string | null>(null);
const selectedEvent = events.find(e => e.id === selectedId) ?? null;
// Analysis state
const [analyzing, setAnalyzing] = useState(false);
const [analysisText, setAnalysisText] = useState('');
const [affectedDocs, setAffectedDocs] = useState<AffectedDoc[]>([]);
const abortRef = useRef<AbortController | null>(null);
// Load events + stats
const loadFeed = useCallback(async () => {
setFeedLoading(true);
try {
const [evtRes, statsRes] = await Promise.all([
listEvents({
source: filterSource || undefined,
impact_level: filterImpact || undefined,
}),
getPerceptionStats(),
]);
setEvents(evtRes.events);
setStats(statsRes);
} catch {
// silent
} finally {
setFeedLoading(false);
}
}, [filterSource, filterImpact]);
useEffect(() => {
const timerId = window.setTimeout(() => { void loadFeed(); }, 0);
return () => window.clearTimeout(timerId);
}, [loadFeed]);
// When selecting a new event, clear previous analysis
const handleSelectEvent = (id: string) => {
if (id === selectedId) return;
abortRef.current?.abort();
setSelectedId(id);
setAnalysisText('');
setAffectedDocs([]);
setAnalyzing(false);
};
const handleAnalyze = useCallback(() => {
if (!selectedId || analyzing) return;
abortRef.current?.abort();
const ctrl = new AbortController();
abortRef.current = ctrl;
setAnalysisText('');
setAffectedDocs([]);
setAnalyzing(true);
void analyzeEvent(
selectedId,
(msg) => {
if (msg.type === 'sources' && msg.docs) {
setAffectedDocs(msg.docs);
} else if (msg.type === 'content' && msg.text) {
setAnalysisText(prev => prev + msg.text);
} else if (msg.type === 'error') {
setAnalysisText(prev => prev + `\n\n⚠ 分析出错:${msg.text ?? '未知错误'}`);
}
},
() => setAnalyzing(false),
ctrl.signal,
);
}, [selectedId, analyzing]);
const handleAbort = () => {
abortRef.current?.abort();
setAnalyzing(false);
};
return (
<div className="relative flex min-h-0 flex-1 flex-col">
<style>{`
@keyframes blink { 0%,100%{opacity:1} 50%{opacity:0} }
`}</style>
<TPattern />
{/* Page header */}
<div style={{ display: 'flex', alignItems: 'baseline', gap: 16, marginBottom: 24 }}>
<h1 style={{ fontSize: 20, fontWeight: 700, color: theme.text, margin: 0 }}></h1>
<span style={{ fontSize: 13, color: theme.text3 }}> · </span>
</div>
{/* Split layout */}
<div style={{
display: 'grid',
gridTemplateColumns: '400px 1fr',
gap: 24,
flex: 1,
minHeight: 560,
}}>
{/* Left: Event feed */}
<EventFeed
events={events}
selectedId={selectedId}
onSelect={handleSelectEvent}
filterSource={filterSource}
filterImpact={filterImpact}
onFilterSource={setFilterSource}
onFilterImpact={setFilterImpact}
stats={stats}
loading={feedLoading}
/>
{/* Right: Analysis panel */}
<AnalysisPanel
event={selectedEvent}
analyzing={analyzing}
analysisText={analysisText}
affectedDocs={affectedDocs}
onAnalyze={handleAnalyze}
onAbort={handleAbort}
/>
</div>
</div>
);
};

View File

@@ -0,0 +1 @@
export { PerceptionPage } from './PerceptionPage';

View File

@@ -1,4 +1,4 @@
import React, { useRef } from 'react';
import React from 'react';
import { useTheme } from '../../contexts';
import type { RetrievalData } from '../../types';

View File

@@ -133,7 +133,6 @@ export const RagChatPage: React.FC = () => {
sessionId,
abortRef.current.signal,
);
// eslint-disable-next-line react-hooks/exhaustive-deps
}, [filterRegulationType, sessionId]);
const sendMessage = (text: string) => {
@@ -173,7 +172,7 @@ export const RagChatPage: React.FC = () => {
};
return (
<div style={{ flex: 1, display: 'flex', height: 'calc(100vh - 128px)' }}>
<div style={{ flex: 1, display: 'flex', minHeight: 0, height: '100%' }}>
{/* ── Left: chat panel ─────────────────────────────────── */}
<div style={{
flex: '0 0 60%',

View File

@@ -1,6 +1,5 @@
import React, { useCallback, useEffect, useState } from 'react';
import { useTheme } from '../../contexts';
import { Content } from '../../components/layout/Content';
import { TPattern } from '../../components/common/TPattern';
import { getSystemStats, getSystemConfig, getSystemHealth, type SystemStats, type SystemConfig, type SystemHealth } from '../../api/status';
import { getDocumentList, type DocInfo } from '../../api/docs';
@@ -107,7 +106,8 @@ export const StatusPage: React.FC = () => {
// Initial load
useEffect(() => {
void loadData();
const timerId = window.setTimeout(() => { void loadData(); }, 0);
return () => window.clearTimeout(timerId);
}, [loadData]);
// Auto-poll every 5 s while any document is still processing
@@ -119,7 +119,7 @@ export const StatusPage: React.FC = () => {
}, [docs, loadData]);
return (
<Content>
<div className="relative w-full">
<style>{`@keyframes spin { to { transform: rotate(360deg); } }`}</style>
<TPattern />
@@ -386,6 +386,6 @@ export const StatusPage: React.FC = () => {
</div>
))}
</section>
</Content>
</div>
);
};

View File

@@ -0,0 +1,20 @@
import { BrowserRouter, Navigate, Route, Routes } from 'react-router-dom';
import { AppShell } from '../components/layout/AppShell';
import { appTabs, defaultTab } from './tabs';
export function AppRouter() {
return (
<BrowserRouter>
<Routes>
<Route element={<AppShell />}>
<Route index element={<Navigate to={defaultTab.path} replace />} />
{appTabs.map((tab) => (
<Route key={tab.id} path={tab.path.slice(1)} element={null} />
))}
<Route path="*" element={<Navigate to={defaultTab.path} replace />} />
</Route>
</Routes>
</BrowserRouter>
);
}

View File

@@ -0,0 +1,73 @@
import type { ComponentType } from 'react';
import { CompliancePage } from '../pages/Compliance';
import { DocsPage } from '../pages/Docs';
import { PerceptionPage } from '../pages/Perception';
import { RagChatPage } from '../pages/RagChat';
import { StatusPage } from '../pages/Status';
export type TabId = 'perception' | 'docs' | 'compliance' | 'status' | 'rag';
export type ContentWidth = 'default' | 'wide' | 'full';
export interface AppTabConfig {
id: TabId;
path: string;
label: string;
component: ComponentType;
keepAlive: boolean;
contentWidth: ContentWidth;
fillHeight?: boolean;
}
export const appTabs: AppTabConfig[] = [
{
id: 'perception',
path: '/perception',
label: '智能感知',
component: PerceptionPage,
keepAlive: true,
contentWidth: 'wide',
fillHeight: true,
},
{
id: 'docs',
path: '/docs',
label: '文档管理',
component: DocsPage,
keepAlive: true,
contentWidth: 'default',
},
{
id: 'compliance',
path: '/compliance',
label: '合规分析',
component: CompliancePage,
keepAlive: true,
contentWidth: 'wide',
fillHeight: true,
},
{
id: 'status',
path: '/status',
label: '系统状态',
component: StatusPage,
keepAlive: true,
contentWidth: 'default',
},
{
id: 'rag',
path: '/rag',
label: '法规对话',
component: RagChatPage,
keepAlive: true,
contentWidth: 'wide',
fillHeight: true,
},
];
export const defaultTab = appTabs.find((tab) => tab.id === 'compliance') ?? appTabs[0];
export function getTabByPath(pathname: string): AppTabConfig {
return appTabs.find((tab) => tab.path === pathname) ?? defaultTab;
}

View File

@@ -1,8 +1,9 @@
@import url('https://fonts.googleapis.com/css2?family=TeleNeo:wght@400;500;600;700&family=JetBrains+Mono:wght@400;500;600&display=swap');
@import "tw-animate-css";
@import "tailwindcss";
@custom-variant dark (&:is(.dark *));
@tailwind base;
@tailwind components;
@tailwind utilities;
/* Light mode (default) */
:root {
@@ -19,6 +20,38 @@
--t-orange: #ff7700;
--t-accent-glow: rgba(226,0,116,0.08);
--t-pattern-opacity: 0.04;
--background: var(--t-bg);
--foreground: var(--t-text);
--card: var(--t-bg-card);
--card-foreground: var(--t-text);
--popover: var(--t-bg-card);
--popover-foreground: var(--t-text);
--primary: #e20074;
--primary-foreground: #ffffff;
--secondary: var(--t-bg-hover);
--secondary-foreground: var(--t-text);
--muted: var(--t-bg-hover);
--muted-foreground: var(--t-text3);
--accent: rgba(226, 0, 116, 0.08);
--accent-foreground: #e20074;
--destructive: #ff4444;
--border: var(--t-border);
--input: var(--t-border);
--ring: rgba(226, 0, 116, 0.35);
--chart-1: #e20074;
--chart-2: #be0060;
--chart-3: #00b89c;
--chart-4: #ff7700;
--chart-5: #4a4a5a;
--radius: 0.625rem;
--sidebar: var(--t-bg-card);
--sidebar-foreground: var(--t-text);
--sidebar-primary: #e20074;
--sidebar-primary-foreground: #ffffff;
--sidebar-accent: var(--t-bg-hover);
--sidebar-accent-foreground: var(--t-text);
--sidebar-border: var(--t-border);
--sidebar-ring: rgba(226, 0, 116, 0.35);
}
/* Dark mode */
@@ -36,6 +69,80 @@
--t-orange: #ff8800;
--t-accent-glow: rgba(226,0,116,0.12);
--t-pattern-opacity: 0.03;
--background: var(--t-bg);
--foreground: var(--t-text);
--card: var(--t-bg-card);
--card-foreground: var(--t-text);
--popover: var(--t-bg-card);
--popover-foreground: var(--t-text);
--primary: #e20074;
--primary-foreground: #ffffff;
--secondary: var(--t-bg-hover);
--secondary-foreground: var(--t-text);
--muted: var(--t-bg-hover);
--muted-foreground: var(--t-text3);
--accent: rgba(226, 0, 116, 0.14);
--accent-foreground: #ff7abf;
--destructive: #ff4444;
--border: var(--t-border);
--input: var(--t-border-light);
--ring: rgba(226, 0, 116, 0.45);
--chart-1: #e20074;
--chart-2: #f04090;
--chart-3: #00d4aa;
--chart-4: #ff8800;
--chart-5: #c0c0d0;
--sidebar: var(--t-bg-card);
--sidebar-foreground: var(--t-text);
--sidebar-primary: #e20074;
--sidebar-primary-foreground: #ffffff;
--sidebar-accent: var(--t-bg-hover);
--sidebar-accent-foreground: var(--t-text);
--sidebar-border: var(--t-border);
--sidebar-ring: rgba(226, 0, 116, 0.45);
}
/* Dim mode — Indigo Dusk: deep navy-purple mid-tone between dark and light */
.dim {
--t-bg: #1e1b2e;
--t-bg-card: #252237;
--t-bg-hover: #2d2945;
--t-bg-elevated: #292541;
--t-border: #3a3650;
--t-border-light: #504c6e;
--t-text: #f0eeff;
--t-text2: #b8b4d8;
--t-text3: #7a7698;
--t-green: #00c4a0;
--t-orange: #ff8820;
--t-accent-glow: rgba(226,0,116,0.14);
--t-pattern-opacity: 0.04;
--background: var(--t-bg);
--foreground: var(--t-text);
--card: var(--t-bg-card);
--card-foreground: var(--t-text);
--popover: var(--t-bg-card);
--popover-foreground: var(--t-text);
--primary: #e20074;
--primary-foreground: #ffffff;
--secondary: var(--t-bg-hover);
--secondary-foreground: var(--t-text);
--muted: var(--t-bg-hover);
--muted-foreground: var(--t-text3);
--accent: rgba(226, 0, 116, 0.12);
--accent-foreground: #f04090;
--destructive: #ff4444;
--border: var(--t-border);
--input: var(--t-border-light);
--ring: rgba(226, 0, 116, 0.40);
--sidebar: var(--t-bg-card);
--sidebar-foreground: var(--t-text);
--sidebar-primary: #e20074;
--sidebar-primary-foreground: #ffffff;
--sidebar-accent: var(--t-bg-hover);
--sidebar-accent-foreground: var(--t-text);
--sidebar-border: var(--t-border);
--sidebar-ring: rgba(226, 0, 116, 0.40);
}
/* Base styles */
@@ -66,6 +173,13 @@ body.dark-mode {
background: #0a0a12;
}
/* Dim mode body */
.dim body,
body.dim-mode {
color: #f0eeff;
background: #1e1b2e;
}
/* Selection */
::selection {
background: rgba(226, 0, 116, 0.3);
@@ -100,6 +214,11 @@ button, input {
transition: none;
}
/* Shell navigation manages its own transition timing. */
[data-shell-tab='true'] {
transition: color 0.2s ease-out;
}
/* T-Systems Button Style */
.t-btn,
.t-btn:hover {
@@ -131,17 +250,22 @@ button, input {
background: linear-gradient(180deg, #12121f, #0a0a12);
}
/* Card gradient for dim mode */
.dim .t-card-gradient {
background: linear-gradient(180deg, #e8e5f4, #f0eef8);
}
/* Card gradient for light mode */
:not(.dark) .t-card-gradient {
:not(.dark):not(.dim) .t-card-gradient {
background: linear-gradient(180deg, #ffffff, #fafafa);
}
/* Light mode shadow for cards */
:not(.dark) .t-card-shadow {
:not(.dark):not(.dim) .t-card-shadow {
box-shadow: 0 2px 8px rgba(226,0,116,0.04);
}
:not(.dark) .t-card-shadow-lg {
:not(.dark):not(.dim) .t-card-shadow-lg {
box-shadow: 0 4px 16px rgba(226,0,116,0.08);
}
@@ -272,3 +396,59 @@ button, input {
background: linear-gradient(135deg, #f0208a 0%, #d01070 100%);
}
}
@theme inline {
--font-heading: 'TeleNeo', 'Segoe UI', system-ui, sans-serif;
--font-sans: 'TeleNeo', 'Segoe UI', system-ui, sans-serif;
--font-mono: 'JetBrains Mono', monospace;
--color-sidebar-ring: var(--sidebar-ring);
--color-sidebar-border: var(--sidebar-border);
--color-sidebar-accent-foreground: var(--sidebar-accent-foreground);
--color-sidebar-accent: var(--sidebar-accent);
--color-sidebar-primary-foreground: var(--sidebar-primary-foreground);
--color-sidebar-primary: var(--sidebar-primary);
--color-sidebar-foreground: var(--sidebar-foreground);
--color-sidebar: var(--sidebar);
--color-chart-5: var(--chart-5);
--color-chart-4: var(--chart-4);
--color-chart-3: var(--chart-3);
--color-chart-2: var(--chart-2);
--color-chart-1: var(--chart-1);
--color-ring: var(--ring);
--color-input: var(--input);
--color-border: var(--border);
--color-destructive: var(--destructive);
--color-accent-foreground: var(--accent-foreground);
--color-accent: var(--accent);
--color-muted-foreground: var(--muted-foreground);
--color-muted: var(--muted);
--color-secondary-foreground: var(--secondary-foreground);
--color-secondary: var(--secondary);
--color-primary-foreground: var(--primary-foreground);
--color-primary: var(--primary);
--color-popover-foreground: var(--popover-foreground);
--color-popover: var(--popover);
--color-card-foreground: var(--card-foreground);
--color-card: var(--card);
--color-foreground: var(--foreground);
--color-background: var(--background);
--radius-sm: calc(var(--radius) * 0.6);
--radius-md: calc(var(--radius) * 0.8);
--radius-lg: var(--radius);
--radius-xl: calc(var(--radius) * 1.4);
--radius-2xl: calc(var(--radius) * 1.8);
--radius-3xl: calc(var(--radius) * 2.2);
--radius-4xl: calc(var(--radius) * 2.6);
}
@layer base {
* {
@apply border-border outline-ring/50;
}
body {
@apply bg-background text-foreground;
}
html {
@apply font-sans;
}
}

Some files were not shown because too many files have changed in this diff Show More