Compare commits
9 Commits
09f9cf2bf0
...
main-20260
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
3674f9171e | ||
|
|
30c7bda389 | ||
|
|
fec22a3a2c | ||
|
|
34d72d7ce9 | ||
|
|
987cc097da | ||
|
|
10a034e294 | ||
|
|
091a02c522 | ||
|
|
37f7a60b0a | ||
|
|
f9ee644f25 |
2
.env
2
.env
@@ -9,7 +9,7 @@ DEBUG=false
|
||||
# ===== Milvus向量数据库配置(已有)=====
|
||||
MILVUS_HOST=6.86.80.8
|
||||
MILVUS_PORT=19530
|
||||
MILVUS_COLLECTION=regulations_dense_1024_v1
|
||||
MILVUS_COLLECTION=regulations_dense_1024_v2
|
||||
MILVUS_DB_NAME=default
|
||||
MILVUS_INDEX_TYPE=IVF_FLAT
|
||||
MILVUS_NLIST=128
|
||||
|
||||
@@ -4,7 +4,7 @@
|
||||
# ===== Milvus向量数据库配置(已有)=====
|
||||
MILVUS_HOST=6.86.80.8
|
||||
MILVUS_PORT=19530
|
||||
MILVUS_COLLECTION=regulations_dense_1024_v1
|
||||
MILVUS_COLLECTION=regulations_dense_1024_v2
|
||||
MILVUS_DB_NAME=default
|
||||
MILVUS_INDEX_TYPE=IVF_FLAT
|
||||
MILVUS_NLIST=128
|
||||
|
||||
@@ -9,7 +9,7 @@ DEBUG=false
|
||||
# ===== Milvus向量数据库配置 =====
|
||||
MILVUS_HOST=6.86.80.8
|
||||
MILVUS_PORT=19530
|
||||
MILVUS_COLLECTION=regulations_dense_1024_v1
|
||||
MILVUS_COLLECTION=regulations_dense_1024_v2
|
||||
MILVUS_DB_NAME=default
|
||||
MILVUS_INDEX_TYPE=IVF_FLAT
|
||||
MILVUS_NLIST=128
|
||||
|
||||
3
.gitignore
vendored
3
.gitignore
vendored
@@ -59,3 +59,6 @@ Thumbs.db
|
||||
|
||||
# logs files
|
||||
logs/
|
||||
|
||||
# codex
|
||||
.agents
|
||||
15
AGENTS.md
15
AGENTS.md
@@ -4,6 +4,12 @@
|
||||
|
||||
- Backend code lives under `backend/app/`; frontend is the Vite app in `frontend/`.
|
||||
|
||||
## Frontend UX Constraints
|
||||
|
||||
- Frontend work in `frontend/` must target desktop Web first.
|
||||
- Do not proactively add mobile-specific adaptations, responsive reflow for small screens, or mobile-first layout compromises unless the user explicitly asks for them.
|
||||
- When desktop and mobile requirements conflict, preserve the desktop Web layout and interaction model by default.
|
||||
|
||||
## Entrypoints
|
||||
|
||||
- Backend entrypoint is `backend/app/main.py`, which re-exports `app` from `app.api.main`.
|
||||
@@ -39,6 +45,15 @@
|
||||
- `tests/verify_mvp.py` also expects the `BGEM3Embedder` stack to be available and explicitly mentions `FlagEmbedding`.
|
||||
- For backend-only changes, prefer focused import/startup checks unless you know the external services and model dependencies are available.
|
||||
|
||||
## Backend Architecture Authority
|
||||
|
||||
- `docs/architecture/backend-project-architecture.md` is the authoritative backend architecture document for ongoing backend development.
|
||||
- New backend business logic must follow `api -> application -> domain ports -> infrastructure`.
|
||||
- Treat `backend/app/shared/bootstrap.py` as the current composition root for backend dependency wiring.
|
||||
- Do not add new business orchestration to `backend/app/services/*` or `backend/app/workflows/*` unless the task is explicitly a migration step.
|
||||
- API routes must not directly access `ConversationStore`; session access should go through application services.
|
||||
- Legacy files may be patched for compatibility or bug fixes, but should not gain new long-term responsibilities.
|
||||
|
||||
## Backend Commenting Standard
|
||||
|
||||
- All comments and docstrings in `backend/**/*.py` must be written in English.
|
||||
|
||||
@@ -105,7 +105,7 @@ ALIBABA_ACCESS_KEY_ID=your_aliyun_access_key_id
|
||||
ALIBABA_ACCESS_KEY_SECRET=your_aliyun_access_key_secret
|
||||
EMBEDDING_API_KEY=your_embedding_api_key_here
|
||||
EMBEDDING_MODEL=text-embedding-v3
|
||||
EMBEDDING_DIM=1536
|
||||
EMBEDDING_DIM=1024
|
||||
PARSER_BACKEND=aliyun
|
||||
CHUNK_BACKEND=aliyun
|
||||
PARSER_FAILURE_MODE=fail
|
||||
|
||||
139
README.md
139
README.md
@@ -1,139 +0,0 @@
|
||||
# AI+合规智能中枢 - 法律法规文档解析入库
|
||||
|
||||
面向车企与工厂的合规智能平台,实现法规文档的解析、分块、嵌入和向量存储。
|
||||
|
||||
## MVP功能
|
||||
|
||||
本次实现的核心功能(最小可用版本):
|
||||
|
||||
- ✅ PDF/DOC/DOCX 文档解析(阿里云文档智能)
|
||||
- ✅ 基于阿里云 `vector_chunks` 的统一切片
|
||||
- ✅ OpenAI 兼容 embedding(`text-embedding-v3`,1536维)
|
||||
- ✅ Milvus 向量数据库存储与 dense-only 检索
|
||||
- ✅ FastAPI接口封装
|
||||
|
||||
## 项目结构
|
||||
|
||||
```text
|
||||
AIRegulation-DocAnalysis-Demo/
|
||||
├── backend/
|
||||
│ ├── app/
|
||||
│ │ ├── api/ # FastAPI 接口层
|
||||
│ │ ├── application/ # 用例编排层
|
||||
│ │ ├── domain/ # 领域模型与稳定端口
|
||||
│ │ ├── infrastructure/ # MinIO / Milvus / 阿里云 / embedding / session 适配
|
||||
│ │ ├── config/ # 配置与日志
|
||||
│ │ └── workers/
|
||||
│ ├── requirements.txt
|
||||
│ └── main.py
|
||||
├── frontend/ # Vite React 前端
|
||||
├── tests/ # 根级测试,导入 backend/app
|
||||
├── docker/
|
||||
│ └── docker-compose.yml
|
||||
├── pyproject.toml
|
||||
└── .env.example
|
||||
```
|
||||
|
||||
## 快速开始
|
||||
|
||||
### 1. 安装依赖
|
||||
|
||||
```bash
|
||||
./dev.sh setup
|
||||
```
|
||||
|
||||
### 2. 启动Milvus向量数据库
|
||||
|
||||
```bash
|
||||
cd docker
|
||||
docker-compose up -d
|
||||
```
|
||||
|
||||
等待Milvus启动完成(约30秒):
|
||||
```bash
|
||||
docker-compose logs -f milvus
|
||||
```
|
||||
|
||||
### 3. 启动API服务
|
||||
|
||||
```bash
|
||||
./dev.sh start api --foreground
|
||||
```
|
||||
|
||||
访问API文档:http://localhost:8000/docs
|
||||
|
||||
## API接口
|
||||
|
||||
### 上传文档
|
||||
|
||||
```bash
|
||||
curl -X POST http://localhost:8000/api/v1/documents/upload \
|
||||
-F "file=@your_regulation.pdf" \
|
||||
-F "doc_name=GB 7258-2017" \
|
||||
-F "regulation_type=车辆安全"
|
||||
```
|
||||
|
||||
### 检索法规
|
||||
|
||||
```bash
|
||||
curl -X POST http://localhost:8000/api/v1/knowledge/search \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"query": "机动车安全技术要求", "top_k": 10}'
|
||||
```
|
||||
|
||||
## 技术栈
|
||||
|
||||
| 类别 | 技术 |
|
||||
|------|------|
|
||||
| 文档解析 | 阿里云文档智能 + python-docx |
|
||||
| 分块策略 | 阿里云 `vector_chunks` |
|
||||
| 嵌入模型 | `text-embedding-v3`(1536维 Dense) |
|
||||
| 向量数据库 | Milvus 2.4(本地Docker部署) |
|
||||
| 检索方式 | Dense-only 检索 |
|
||||
| API框架 | FastAPI |
|
||||
|
||||
## 配置
|
||||
|
||||
创建 `.env` 文件(参考 `.env.example`):
|
||||
|
||||
```env
|
||||
# Milvus配置
|
||||
MILVUS_HOST=localhost
|
||||
MILVUS_PORT=19530
|
||||
|
||||
# 阿里云文档解析
|
||||
ALIBABA_ACCESS_KEY_ID=your_aliyun_access_key_id
|
||||
ALIBABA_ACCESS_KEY_SECRET=your_aliyun_access_key_secret
|
||||
PARSER_BACKEND=aliyun
|
||||
CHUNK_BACKEND=aliyun
|
||||
|
||||
# embedding 配置
|
||||
EMBEDDING_MODEL=text-embedding-v3
|
||||
EMBEDDING_DIM=1536
|
||||
EMBEDDING_API_KEY=your_embedding_api_key_here
|
||||
|
||||
# 分块配置
|
||||
CHUNK_SIZE=512
|
||||
```
|
||||
|
||||
## 后续迭代(不在本次MVP范围)
|
||||
|
||||
- LLM摘要生成(当前上传主链路默认不生成)
|
||||
- 文档上传UI界面
|
||||
- 混合检索问答功能
|
||||
- 法规变更监控与自动更新
|
||||
|
||||
## 解析产物
|
||||
|
||||
上传成功后,系统会把阿里云解析的中间结果持久化到 MinIO:
|
||||
|
||||
- `artifacts/{doc_id}/layouts.json`
|
||||
- `artifacts/{doc_id}/structure_nodes.json`
|
||||
- `artifacts/{doc_id}/semantic_blocks.json`
|
||||
- `artifacts/{doc_id}/vector_chunks.json`
|
||||
|
||||
当前默认 Milvus collection 为 `regulations_dense_1536_v2`。
|
||||
|
||||
## 许可证
|
||||
|
||||
MIT License
|
||||
@@ -2,6 +2,13 @@
|
||||
|
||||
`backend` 是当前正式使用的 FastAPI 后端目录,入口为 `app.main:app`。
|
||||
|
||||
## 架构约束入口
|
||||
|
||||
- Backend authoritative architecture 文档:`docs/architecture/backend-project-architecture.md`
|
||||
- Backend migration RFC:`docs/rfc/backend-api-parsing-embedding-migration-requirements.md`
|
||||
- 后续 backend 新增功能和重构默认遵守:`api -> application -> domain ports -> infrastructure`
|
||||
- `backend/app/services/*` 与 `backend/app/workflows/*` 为迁移期 legacy 目录,除迁移或兼容修复外,不应新增业务编排逻辑。
|
||||
|
||||
## 启动
|
||||
|
||||
```bash
|
||||
@@ -34,9 +41,14 @@ PYTHONPATH=backend uvicorn app.main:app --host 0.0.0.0 --port 8000 --reload
|
||||
```text
|
||||
backend/
|
||||
├── app/
|
||||
│ ├── api/ # FastAPI 路由与模型
|
||||
│ ├── api/ # FastAPI 路由与 transport models
|
||||
│ ├── application/ # 用例编排层
|
||||
│ ├── domain/ # 核心业务模型与稳定端口
|
||||
│ ├── infrastructure/ # 外部系统适配器
|
||||
│ ├── shared/ # composition root 与横切支撑
|
||||
│ ├── config/ # 配置与日志
|
||||
│ ├── services/ # 文档处理、LLM、RAG、存储
|
||||
│ ├── services/ # legacy façade / 兼容入口
|
||||
│ ├── workflows/ # legacy workflow 入口
|
||||
│ └── workers/ # 任务相关代码
|
||||
├── .env.example
|
||||
├── requirements.txt
|
||||
@@ -46,4 +58,13 @@ backend/
|
||||
## 说明
|
||||
|
||||
- 路由前缀保持为 `/api/v1`,以兼容当前前端。
|
||||
- 原 `backend/app/api/routes/docs.py`、`rag.py`、`compliance.py`、`status.py` 仍保留在仓库中,但不再作为主路由入口。
|
||||
- 当前主业务链路入口是 `documents`、`knowledge`、`agent`。
|
||||
- `compliance.py` 当前仍被挂载,但尚未满足目标架构约束;在迁移前不应继续扩展业务编排。
|
||||
- `docs.py` 与 `rag.py` 为遗留/非主入口,不应继续扩展。
|
||||
|
||||
## 开发约束
|
||||
|
||||
- backend 开发前先阅读 `docs/architecture/backend-project-architecture.md`。
|
||||
- 新增业务能力默认落在 `application` 层,由 `api` 调用,不要直接写进 route。
|
||||
- route 不应直接访问 MinIO、Milvus、Parser SDK、LLM SDK 或 `ConversationStore`。
|
||||
- `backend/app/shared/bootstrap.py` 是当前 composition root;依赖装配优先收口到这里。
|
||||
|
||||
8
backend/aliyun_parser/.claude/settings.local.json
Normal file
8
backend/aliyun_parser/.claude/settings.local.json
Normal file
@@ -0,0 +1,8 @@
|
||||
{
|
||||
"permissions": {
|
||||
"allow": [
|
||||
"Bash(python3 *)",
|
||||
"Bash(PGPASSWORD=postgresql123456 psql *)"
|
||||
]
|
||||
}
|
||||
}
|
||||
475
backend/aliyun_parser/parse_pdf.py
Normal file
475
backend/aliyun_parser/parse_pdf.py
Normal file
@@ -0,0 +1,475 @@
|
||||
#!/usr/bin/env python3
|
||||
# -*- coding: utf-8 -*-
|
||||
"""
|
||||
阿里云文档智能 API 解析 PDF,输出三层结构 chunks
|
||||
- structure_nodes: 目录树结构
|
||||
- semantic_blocks: 语义块(章节文本、表格、图片)
|
||||
- vector_chunks: 检索块(带 overlap 切分)
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import re
|
||||
import time
|
||||
from pathlib import Path
|
||||
from typing import Dict, List
|
||||
|
||||
from alibabacloud_docmind_api20220711.client import Client as DocmindClient
|
||||
from alibabacloud_tea_openapi import models as open_api_models
|
||||
from alibabacloud_docmind_api20220711 import models as docmind_models
|
||||
from alibabacloud_tea_util import models as util_models
|
||||
|
||||
# ===================== 阿里云配置 =====================
|
||||
ALIBABA_ACCESS_KEY_ID = "LTAI5t6fWvAsvZkoF9WTbtys"
|
||||
ALIBABA_ACCESS_KEY_SECRET = "WX4oaE4FLYRa5L85TMQkqRPHeTJAF0"
|
||||
ALIBABA_ENDPOINT = "docmind-api.cn-hangzhou.aliyuncs.com"
|
||||
|
||||
# ===================== 切分参数 =====================
|
||||
MAX_CHARS = 600
|
||||
OVERLAP_CHARS = 80
|
||||
|
||||
# ===================== 布局类型常量 =====================
|
||||
TOC_TITLES = {"目次", "目录"}
|
||||
TITLE_SUBTYPES = {"doc_title", "para_title"}
|
||||
TEXT_SUBTYPES = {"para", "none"}
|
||||
FIGURE_TYPES = {"figure", "figure_name", "figure_note"}
|
||||
FIGURE_SUBTYPES = {"picture", "pic_title", "pic_caption"}
|
||||
|
||||
|
||||
# ===================== 阿里云 API 客户端 =====================
|
||||
def init_client() -> DocmindClient:
|
||||
config = open_api_models.Config(
|
||||
access_key_id=ALIBABA_ACCESS_KEY_ID,
|
||||
access_key_secret=ALIBABA_ACCESS_KEY_SECRET,
|
||||
)
|
||||
config.endpoint = ALIBABA_ENDPOINT
|
||||
return DocmindClient(config)
|
||||
|
||||
|
||||
def submit_job(client: DocmindClient, file_path: str) -> str:
|
||||
"""提交文档解析任务"""
|
||||
file_name = Path(file_path).name
|
||||
request = docmind_models.SubmitDocParserJobAdvanceRequest(
|
||||
file_url_object=open(file_path, "rb"),
|
||||
file_name=file_name,
|
||||
file_name_extension=Path(file_path).suffix.lstrip("."),
|
||||
llm_enhancement=True,
|
||||
enhancement_mode="VLM",
|
||||
)
|
||||
runtime = util_models.RuntimeOptions()
|
||||
response = client.submit_doc_parser_job_advance(request, runtime)
|
||||
return response.body.data.id
|
||||
|
||||
|
||||
def query_status(client: DocmindClient, task_id: str) -> Dict:
|
||||
"""查询任务状态"""
|
||||
request = docmind_models.QueryDocParserStatusRequest(id=task_id)
|
||||
response = client.query_doc_parser_status(request)
|
||||
return response.body.data.to_map() if response.body.data else None
|
||||
|
||||
|
||||
def wait_for_completion(client: DocmindClient, task_id: str, poll_interval: int = 5) -> bool:
|
||||
"""等待任务完成"""
|
||||
while True:
|
||||
status_data = query_status(client, task_id)
|
||||
if not status_data:
|
||||
return False
|
||||
status = status_data.get("Status", "").lower()
|
||||
if status == "success":
|
||||
return True
|
||||
elif status == "failed":
|
||||
print(f"任务失败: {status_data}")
|
||||
return False
|
||||
print(f"任务状态: {status}, 等待中...")
|
||||
time.sleep(poll_interval)
|
||||
|
||||
|
||||
def get_result(client: DocmindClient, task_id: str, layout_num: int = 0, layout_step_size: int = 50) -> Dict:
|
||||
"""获取解析结果"""
|
||||
request = docmind_models.GetDocParserResultRequest(
|
||||
id=task_id,
|
||||
layout_step_size=layout_step_size,
|
||||
layout_num=layout_num,
|
||||
)
|
||||
response = client.get_doc_parser_result(request)
|
||||
return response.body.data if response.body.data else None
|
||||
|
||||
|
||||
def collect_all_results(client: DocmindClient, task_id: str, layout_step_size: int = 50) -> List[Dict]:
|
||||
"""收集所有解析结果"""
|
||||
all_layouts = []
|
||||
layout_num = 0
|
||||
while True:
|
||||
result_data = get_result(client, task_id, layout_num, layout_step_size)
|
||||
if not result_data:
|
||||
break
|
||||
layouts = result_data.get("layouts", [])
|
||||
if not layouts:
|
||||
break
|
||||
all_layouts.extend(layouts)
|
||||
layout_num += len(layouts)
|
||||
if len(layouts) < layout_step_size:
|
||||
break
|
||||
return all_layouts
|
||||
|
||||
|
||||
# ===================== 文本处理 =====================
|
||||
def normalize_text(text: str) -> str:
|
||||
text = text.replace("\r", "\n")
|
||||
text = text.replace(" ", " ")
|
||||
text = re.sub(r"\n+", "\n", text)
|
||||
text = re.sub(r"[ \t]+", " ", text)
|
||||
return text.strip()
|
||||
|
||||
|
||||
def get_page(layout: Dict) -> int:
|
||||
return layout.get("pageNum", layout.get("pageNumber", 0))
|
||||
|
||||
|
||||
def get_text(layout: Dict) -> str:
|
||||
text = normalize_text(layout.get("text", ""))
|
||||
if text:
|
||||
return text
|
||||
return normalize_text(layout.get("markdownContent", ""))
|
||||
|
||||
|
||||
# ===================== 布局类型判断 =====================
|
||||
def is_title(layout: Dict) -> bool:
|
||||
return layout.get("type") == "title" or layout.get("subType") in TITLE_SUBTYPES
|
||||
|
||||
|
||||
def is_text(layout: Dict) -> bool:
|
||||
return layout.get("type") == "text" and layout.get("subType", "none") in TEXT_SUBTYPES
|
||||
|
||||
|
||||
def is_figure(layout: Dict) -> bool:
|
||||
return layout.get("type") in FIGURE_TYPES or layout.get("subType") in FIGURE_SUBTYPES
|
||||
|
||||
|
||||
def is_table(layout: Dict) -> bool:
|
||||
return layout.get("type") == "table"
|
||||
|
||||
|
||||
def is_toc_layout(layout: Dict) -> bool:
|
||||
text = get_text(layout)
|
||||
if text in TOC_TITLES:
|
||||
return True
|
||||
if get_page(layout) == 1 and re.match(r"^\d+(\.\d+)*\s+.+[.。…]{2,}\s*\d+$", text):
|
||||
return True
|
||||
return False
|
||||
|
||||
|
||||
def extract_table_text(layout: Dict) -> str:
|
||||
rows = []
|
||||
for cell in layout.get("cells", []):
|
||||
texts = []
|
||||
for cell_layout in cell.get("layouts", []):
|
||||
cell_text = normalize_text(cell_layout.get("text", ""))
|
||||
if cell_text:
|
||||
texts.append(cell_text)
|
||||
if texts:
|
||||
rows.append(" ".join(texts))
|
||||
return "\n".join(rows).strip()
|
||||
|
||||
|
||||
# ===================== 结构层:目录树 =====================
|
||||
def build_structure_nodes(layouts: List[Dict]) -> List[Dict]:
|
||||
nodes = []
|
||||
for layout in layouts:
|
||||
if not is_title(layout):
|
||||
continue
|
||||
text = get_text(layout)
|
||||
if not text or text in TOC_TITLES:
|
||||
continue
|
||||
nodes.append(
|
||||
{
|
||||
"unique_id": layout.get("uniqueId"),
|
||||
"page": get_page(layout),
|
||||
"index": layout.get("index", 0),
|
||||
"level": layout.get("level", 0),
|
||||
"title": text,
|
||||
"type": layout.get("type"),
|
||||
"sub_type": layout.get("subType"),
|
||||
}
|
||||
)
|
||||
return nodes
|
||||
|
||||
|
||||
# ===================== 语义层:章节内容 =====================
|
||||
def update_section_path(section_stack: List[Dict], layout: Dict) -> List[Dict]:
|
||||
level = layout.get("level", 0)
|
||||
title = get_text(layout)
|
||||
while section_stack and section_stack[-1]["level"] >= level:
|
||||
section_stack.pop()
|
||||
section_stack.append(
|
||||
{
|
||||
"level": level,
|
||||
"title": title,
|
||||
"page": get_page(layout),
|
||||
"unique_id": layout.get("uniqueId"),
|
||||
}
|
||||
)
|
||||
return section_stack
|
||||
|
||||
|
||||
def section_path_titles(section_stack: List[Dict]) -> List[str]:
|
||||
return [item["title"] for item in section_stack]
|
||||
|
||||
|
||||
def flush_text_block(blocks: List[Dict], semantic_blocks: List[Dict], block_id: int) -> int:
|
||||
if not blocks:
|
||||
return block_id
|
||||
|
||||
texts = [item["text"] for item in blocks if item["text"]]
|
||||
merged_text = "\n".join(texts).strip()
|
||||
if not merged_text:
|
||||
return block_id
|
||||
|
||||
semantic_blocks.append(
|
||||
{
|
||||
"semantic_id": f"semantic-{block_id}",
|
||||
"block_type": "section_text",
|
||||
"page_start": min(item["page"] for item in blocks),
|
||||
"page_end": max(item["page"] for item in blocks),
|
||||
"section_path": blocks[0]["section_path"],
|
||||
"section_level": blocks[0]["section_level"],
|
||||
"section_title": blocks[0]["section_title"],
|
||||
"source_ids": [item["unique_id"] for item in blocks if item.get("unique_id")],
|
||||
"text": merged_text,
|
||||
}
|
||||
)
|
||||
return block_id + 1
|
||||
|
||||
|
||||
def build_semantic_blocks(layouts: List[Dict]) -> List[Dict]:
|
||||
semantic_blocks = []
|
||||
section_stack = []
|
||||
pending_text_blocks = []
|
||||
block_id = 1
|
||||
skip_toc_page = False
|
||||
|
||||
for layout in layouts:
|
||||
text = get_text(layout)
|
||||
page = get_page(layout)
|
||||
|
||||
if is_toc_layout(layout):
|
||||
skip_toc_page = True
|
||||
continue
|
||||
if skip_toc_page and page == 1:
|
||||
continue
|
||||
if skip_toc_page and page != 1:
|
||||
skip_toc_page = False
|
||||
|
||||
if is_title(layout):
|
||||
block_id = flush_text_block(pending_text_blocks, semantic_blocks, block_id)
|
||||
pending_text_blocks = []
|
||||
section_stack = update_section_path(section_stack, layout)
|
||||
continue
|
||||
|
||||
section_path = section_path_titles(section_stack)
|
||||
section_title = section_path[-1] if section_path else "未分类"
|
||||
section_level = len(section_path)
|
||||
|
||||
if is_table(layout):
|
||||
block_id = flush_text_block(pending_text_blocks, semantic_blocks, block_id)
|
||||
pending_text_blocks = []
|
||||
table_text = extract_table_text(layout)
|
||||
if table_text:
|
||||
semantic_blocks.append(
|
||||
{
|
||||
"semantic_id": f"semantic-{block_id}",
|
||||
"block_type": "table",
|
||||
"page_start": page,
|
||||
"page_end": page,
|
||||
"section_path": section_path,
|
||||
"section_level": section_level,
|
||||
"section_title": section_title,
|
||||
"source_ids": [layout.get("uniqueId")],
|
||||
"text": table_text,
|
||||
}
|
||||
)
|
||||
block_id += 1
|
||||
continue
|
||||
|
||||
if is_figure(layout):
|
||||
block_id = flush_text_block(pending_text_blocks, semantic_blocks, block_id)
|
||||
pending_text_blocks = []
|
||||
if text:
|
||||
semantic_blocks.append(
|
||||
{
|
||||
"semantic_id": f"semantic-{block_id}",
|
||||
"block_type": "figure",
|
||||
"page_start": page,
|
||||
"page_end": page,
|
||||
"section_path": section_path,
|
||||
"section_level": section_level,
|
||||
"section_title": section_title,
|
||||
"source_ids": [layout.get("uniqueId")],
|
||||
"text": text,
|
||||
}
|
||||
)
|
||||
block_id += 1
|
||||
continue
|
||||
|
||||
if is_text(layout) and text:
|
||||
pending_text_blocks.append(
|
||||
{
|
||||
"page": page,
|
||||
"text": text,
|
||||
"unique_id": layout.get("uniqueId"),
|
||||
"section_path": section_path,
|
||||
"section_level": section_level,
|
||||
"section_title": section_title,
|
||||
}
|
||||
)
|
||||
|
||||
flush_text_block(pending_text_blocks, semantic_blocks, block_id)
|
||||
return semantic_blocks
|
||||
|
||||
|
||||
# ===================== 检索层:向量 chunks =====================
|
||||
def split_text_with_overlap(text: str, max_chars: int, overlap_chars: int) -> List[str]:
|
||||
text = text.strip()
|
||||
if len(text) <= max_chars:
|
||||
return [text] if text else []
|
||||
|
||||
parts = []
|
||||
start = 0
|
||||
while start < len(text):
|
||||
end = min(len(text), start + max_chars)
|
||||
parts.append(text[start:end].strip())
|
||||
if end >= len(text):
|
||||
break
|
||||
start = max(0, end - overlap_chars)
|
||||
return [part for part in parts if part]
|
||||
|
||||
|
||||
def build_vector_chunks(
|
||||
semantic_blocks: List[Dict],
|
||||
doc_id: str,
|
||||
doc_title: str,
|
||||
max_chars: int,
|
||||
overlap_chars: int,
|
||||
) -> List[Dict]:
|
||||
vector_chunks = []
|
||||
chunk_index = 1
|
||||
|
||||
for block in semantic_blocks:
|
||||
pieces = split_text_with_overlap(block["text"], max_chars, overlap_chars)
|
||||
for piece_index, piece in enumerate(pieces, start=1):
|
||||
if block["section_path"]:
|
||||
header = f"标准:{doc_title}\n章节:{' > '.join(block['section_path'])}\n\n"
|
||||
else:
|
||||
header = f"标准:{doc_title}\n\n"
|
||||
vector_chunks.append(
|
||||
{
|
||||
"doc_id": doc_id,
|
||||
"doc_title": doc_title,
|
||||
"chunk_id": f"chunk-{chunk_index}",
|
||||
"chunk_index": chunk_index,
|
||||
"semantic_id": block["semantic_id"],
|
||||
"chunk_type": block["block_type"],
|
||||
"piece_index": piece_index,
|
||||
"page_start": block["page_start"],
|
||||
"page_end": block["page_end"],
|
||||
"section_path": block["section_path"],
|
||||
"section_level": block["section_level"],
|
||||
"section_title": block["section_title"],
|
||||
"source_ids": block["source_ids"],
|
||||
"text": piece,
|
||||
"embedding_text": header + piece,
|
||||
}
|
||||
)
|
||||
chunk_index += 1
|
||||
|
||||
return vector_chunks
|
||||
|
||||
|
||||
# ===================== 主转换函数 =====================
|
||||
def convert_layouts(
|
||||
layouts: List[Dict],
|
||||
doc_id: str,
|
||||
doc_title: str,
|
||||
max_chars: int,
|
||||
overlap_chars: int,
|
||||
) -> Dict:
|
||||
structure_nodes = build_structure_nodes(layouts)
|
||||
semantic_blocks = build_semantic_blocks(layouts)
|
||||
vector_chunks = build_vector_chunks(
|
||||
semantic_blocks,
|
||||
doc_id=doc_id,
|
||||
doc_title=doc_title,
|
||||
max_chars=max_chars,
|
||||
overlap_chars=overlap_chars,
|
||||
)
|
||||
return {
|
||||
"doc_id": doc_id,
|
||||
"doc_title": doc_title,
|
||||
"structure_nodes": structure_nodes,
|
||||
"semantic_blocks": semantic_blocks,
|
||||
"vector_chunks": vector_chunks,
|
||||
}
|
||||
|
||||
|
||||
# ===================== CLI 入口 =====================
|
||||
def main() -> None:
|
||||
parser = argparse.ArgumentParser(description="阿里云文档智能解析 PDF,输出三层结构 chunks")
|
||||
parser.add_argument("pdf_path", help="PDF 文件路径")
|
||||
parser.add_argument("--out", default="vector_chunks.json", help="输出 JSON 文件路径")
|
||||
parser.add_argument("--layouts-out", dest="layouts_output", help="输出原始 layouts JSON")
|
||||
parser.add_argument("--doc-id", default="GB14747-2006", help="文档 ID")
|
||||
parser.add_argument("--doc-title", default="GB 14747—2006 儿童三轮车安全要求", help="文档标题")
|
||||
parser.add_argument("--max-chars", type=int, default=MAX_CHARS, help="单个检索 chunk 最大字符数")
|
||||
parser.add_argument("--overlap-chars", type=int, default=OVERLAP_CHARS, help="相邻检索 chunk 重叠字符数")
|
||||
parser.add_argument("--poll-interval", type=int, default=5, help="轮询间隔(秒)")
|
||||
args = parser.parse_args()
|
||||
|
||||
pdf_path = Path(args.pdf_path).expanduser().resolve()
|
||||
if not pdf_path.exists():
|
||||
raise FileNotFoundError(f"PDF 文件不存在: {pdf_path}")
|
||||
|
||||
# 1. 提交阿里云任务
|
||||
client = init_client()
|
||||
print(f"提交任务: {pdf_path}")
|
||||
task_id = submit_job(client, str(pdf_path))
|
||||
print(f"任务 ID: {task_id}")
|
||||
|
||||
# 2. 等待完成
|
||||
print("等待任务完成...")
|
||||
if not wait_for_completion(client, task_id, args.poll_interval):
|
||||
print("任务失败,退出")
|
||||
return
|
||||
|
||||
# 3. 获取 layouts
|
||||
print("获取解析结果...")
|
||||
layouts = collect_all_results(client, task_id)
|
||||
print(f"获取到 {len(layouts)} 个布局块")
|
||||
|
||||
# 4. 输出原始 layouts(可选)
|
||||
if args.layouts_output:
|
||||
layouts_path = Path(args.layouts_output).expanduser().resolve()
|
||||
layouts_path.write_text(json.dumps(layouts, ensure_ascii=False, indent=2), encoding="utf-8")
|
||||
print(f"原始 layouts 已写入: {layouts_path}")
|
||||
|
||||
# 5. 转换为三层结构
|
||||
print("转换为三层结构...")
|
||||
data = convert_layouts(
|
||||
layouts,
|
||||
doc_id=args.doc_id,
|
||||
doc_title=args.doc_title,
|
||||
max_chars=args.max_chars,
|
||||
overlap_chars=args.overlap_chars,
|
||||
)
|
||||
|
||||
# 6. 输出结果
|
||||
output_path = Path(args.out).expanduser().resolve()
|
||||
output_path.write_text(json.dumps(data, ensure_ascii=False, indent=2), encoding="utf-8")
|
||||
|
||||
print(f"结构层节点数: {len(data['structure_nodes'])}")
|
||||
print(f"语义层块数: {len(data['semantic_blocks'])}")
|
||||
print(f"检索层块数: {len(data['vector_chunks'])}")
|
||||
print(f"输出文件: {output_path}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
115
backend/aliyun_parser/rebuild_milvus_collection.py
Normal file
115
backend/aliyun_parser/rebuild_milvus_collection.py
Normal file
@@ -0,0 +1,115 @@
|
||||
"""Rebuild the migrated Milvus collection from saved vector chunks."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
from pathlib import Path
|
||||
|
||||
from pymilvus import Collection, CollectionSchema, DataType, FieldSchema, connections, utility
|
||||
|
||||
|
||||
DEFAULT_COLLECTION = "regulations_dense_1024_v2"
|
||||
DEFAULT_DIM = 1024
|
||||
|
||||
|
||||
def build_collection(name: str, dim: int) -> Collection:
|
||||
"""Create the migrated Milvus collection from scratch."""
|
||||
if utility.has_collection(name):
|
||||
utility.drop_collection(name)
|
||||
|
||||
schema = CollectionSchema(
|
||||
fields=[
|
||||
FieldSchema(name="id", dtype=DataType.VARCHAR, max_length=128, is_primary=True, auto_id=False),
|
||||
FieldSchema(name="doc_id", dtype=DataType.VARCHAR, max_length=64),
|
||||
FieldSchema(name="doc_title", dtype=DataType.VARCHAR, max_length=256),
|
||||
FieldSchema(name="chunk_id", dtype=DataType.VARCHAR, max_length=128),
|
||||
FieldSchema(name="chunk_index", dtype=DataType.INT64),
|
||||
FieldSchema(name="piece_index", dtype=DataType.INT64),
|
||||
FieldSchema(name="text", dtype=DataType.VARCHAR, max_length=65535),
|
||||
FieldSchema(name="embedding_text", dtype=DataType.VARCHAR, max_length=65535),
|
||||
FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=dim),
|
||||
FieldSchema(name="semantic_id", dtype=DataType.VARCHAR, max_length=128),
|
||||
FieldSchema(name="chunk_type", dtype=DataType.VARCHAR, max_length=64),
|
||||
FieldSchema(name="page_start", dtype=DataType.INT64),
|
||||
FieldSchema(name="page_end", dtype=DataType.INT64),
|
||||
FieldSchema(name="section_level", dtype=DataType.INT64),
|
||||
FieldSchema(name="source_ids", dtype=DataType.VARCHAR, max_length=4096),
|
||||
FieldSchema(name="section_path", dtype=DataType.VARCHAR, max_length=4096),
|
||||
FieldSchema(name="section_title", dtype=DataType.VARCHAR, max_length=512),
|
||||
FieldSchema(name="metadata_json", dtype=DataType.VARCHAR, max_length=65535),
|
||||
FieldSchema(name="created_at", dtype=DataType.INT64),
|
||||
],
|
||||
description="Dense-only regulations index",
|
||||
enable_dynamic_field=False,
|
||||
)
|
||||
collection = Collection(name=name, schema=schema)
|
||||
collection.create_index(
|
||||
field_name="embedding",
|
||||
index_params={
|
||||
"metric_type": "COSINE",
|
||||
"index_type": "IVF_FLAT",
|
||||
"params": {"nlist": 128},
|
||||
},
|
||||
)
|
||||
return collection
|
||||
|
||||
|
||||
def load_chunks(payload_path: Path) -> list[dict]:
|
||||
"""Load vector chunks emitted by the Aliyun parser pipeline."""
|
||||
payload = json.loads(payload_path.read_text(encoding="utf-8"))
|
||||
if isinstance(payload, dict):
|
||||
chunks = payload.get("vector_chunks", [])
|
||||
else:
|
||||
chunks = payload
|
||||
if not isinstance(chunks, list):
|
||||
raise ValueError("vector chunk payload must be a list or a dict containing vector_chunks")
|
||||
return chunks
|
||||
|
||||
|
||||
def main() -> None:
|
||||
"""Rebuild the target collection from a vector chunk payload."""
|
||||
parser = argparse.ArgumentParser(description="Rebuild the migrated Milvus collection.")
|
||||
parser.add_argument("--host", default="127.0.0.1", help="Milvus host")
|
||||
parser.add_argument("--port", default="19530", help="Milvus port")
|
||||
parser.add_argument("--collection", default=DEFAULT_COLLECTION, help="Milvus collection name")
|
||||
parser.add_argument("--dim", type=int, default=DEFAULT_DIM, help="Embedding dimension")
|
||||
parser.add_argument("--payload", required=True, help="Path to vector_chunks.json or a compatible JSON file")
|
||||
args = parser.parse_args()
|
||||
|
||||
connections.connect("default", host=args.host, port=args.port)
|
||||
collection = build_collection(args.collection, args.dim)
|
||||
chunks = load_chunks(Path(args.payload))
|
||||
if not chunks:
|
||||
print("No vector chunks found; collection was created but remains empty.")
|
||||
return
|
||||
|
||||
data = [
|
||||
[chunk["chunk_id"] for chunk in chunks],
|
||||
[chunk["doc_id"] for chunk in chunks],
|
||||
[chunk["doc_title"] for chunk in chunks],
|
||||
[chunk["chunk_id"] for chunk in chunks],
|
||||
[int(chunk.get("chunk_index", 0) or 0) for chunk in chunks],
|
||||
[int(chunk.get("piece_index", 0) or 0) for chunk in chunks],
|
||||
[str(chunk.get("text", ""))[:65535] for chunk in chunks],
|
||||
[str(chunk.get("embedding_text", chunk.get("text", "")))[:65535] for chunk in chunks],
|
||||
[chunk["embedding"] for chunk in chunks],
|
||||
[str(chunk.get("semantic_id", "")) for chunk in chunks],
|
||||
[str(chunk.get("chunk_type", "")) for chunk in chunks],
|
||||
[int(chunk.get("page_start", 0) or 0) for chunk in chunks],
|
||||
[int(chunk.get("page_end", 0) or 0) for chunk in chunks],
|
||||
[int(chunk.get("section_level", 0) or 0) for chunk in chunks],
|
||||
[json.dumps(chunk.get("source_ids", []), ensure_ascii=False) for chunk in chunks],
|
||||
[json.dumps(chunk.get("section_path", []), ensure_ascii=False) for chunk in chunks],
|
||||
[str(chunk.get("section_title", "")) for chunk in chunks],
|
||||
[json.dumps(chunk, ensure_ascii=False) for chunk in chunks],
|
||||
[int(chunk.get("created_at", 0) or 0) for chunk in chunks],
|
||||
]
|
||||
collection.insert(data)
|
||||
collection.flush()
|
||||
collection.load()
|
||||
print(f"Rebuilt collection {args.collection} with {len(chunks)} chunks.")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
122
backend/aliyun_parser/schema.sql
Normal file
122
backend/aliyun_parser/schema.sql
Normal file
@@ -0,0 +1,122 @@
|
||||
-- 法规文档向量检索系统数据库表结构
|
||||
-- PostgreSQL
|
||||
|
||||
-- ==================== 文档表 ====================
|
||||
CREATE TABLE documents (
|
||||
id SERIAL PRIMARY KEY,
|
||||
doc_id VARCHAR(128) UNIQUE NOT NULL, -- 文档唯一标识,如 "GB14747-2006"
|
||||
title VARCHAR(512) NOT NULL, -- 文档标题
|
||||
doc_type VARCHAR(32), -- 文档类型:标准/法规/规范
|
||||
standard_number VARCHAR(64), -- 标准编号:如 "GB 14747-2006"
|
||||
publish_date DATE, -- 发布日期
|
||||
implement_date DATE, -- 实施日期
|
||||
status VARCHAR(32), -- 状态:现行/废止/修订
|
||||
source_url VARCHAR(512), -- 来源 URL
|
||||
file_path VARCHAR(512), -- 本地 PDF 文件路径
|
||||
file_size INT, -- 文件大小(字节)
|
||||
upload_time TIMESTAMP DEFAULT NOW(), -- 上传时间
|
||||
created_at TIMESTAMP DEFAULT NOW(),
|
||||
updated_at TIMESTAMP DEFAULT NOW()
|
||||
);
|
||||
|
||||
COMMENT ON TABLE documents IS '文档元数据表';
|
||||
COMMENT ON COLUMN documents.doc_id IS '文档唯一标识,用于关联 Milvus 和其他表';
|
||||
COMMENT ON COLUMN documents.standard_number IS '标准编号,如 GB 14747-2006';
|
||||
|
||||
-- ==================== 章节结构表 ====================
|
||||
CREATE TABLE sections (
|
||||
id SERIAL PRIMARY KEY,
|
||||
doc_id VARCHAR(128) NOT NULL,
|
||||
unique_id VARCHAR(64) NOT NULL, -- 阿里云返回的唯一标识
|
||||
level INT NOT NULL, -- 层级:1, 2, 3...
|
||||
title VARCHAR(512) NOT NULL, -- 章节标题
|
||||
page INT, -- 所在页码
|
||||
index INT, -- 页内顺序
|
||||
parent_id INT, -- 父章节 ID(树形结构)
|
||||
created_at TIMESTAMP DEFAULT NOW(),
|
||||
|
||||
CONSTRAINT fk_sections_doc_id FOREIGN KEY (doc_id) REFERENCES documents(doc_id),
|
||||
CONSTRAINT fk_sections_parent_id FOREIGN KEY (parent_id) REFERENCES sections(id),
|
||||
CONSTRAINT uq_sections_doc_unique UNIQUE (doc_id, unique_id)
|
||||
);
|
||||
|
||||
COMMENT ON TABLE sections IS '章节结构表,用于目录导航';
|
||||
COMMENT ON COLUMN sections.parent_id IS '父章节 ID,构建树形结构';
|
||||
COMMENT ON COLUMN sections.level IS '层级深度,1 为最顶层';
|
||||
|
||||
-- ==================== 语义块表 ====================
|
||||
CREATE TABLE semantic_blocks (
|
||||
id SERIAL PRIMARY KEY,
|
||||
doc_id VARCHAR(128) NOT NULL,
|
||||
semantic_id VARCHAR(64) NOT NULL, -- 语义块唯一标识
|
||||
block_type VARCHAR(32) NOT NULL, -- 类型:section_text/table/figure
|
||||
page_start INT NOT NULL, -- 起始页码
|
||||
page_end INT NOT NULL, -- 结束页码
|
||||
section_id INT, -- 所属章节
|
||||
section_title VARCHAR(512), -- 章节标题(冗余,方便查询)
|
||||
section_level INT, -- 章节层级
|
||||
source_ids JSONB, -- 原始 layout IDs(JSON 数组)
|
||||
text TEXT NOT NULL, -- 完整内容(未被切分)
|
||||
created_at TIMESTAMP DEFAULT NOW(),
|
||||
|
||||
CONSTRAINT fk_semantic_blocks_doc_id FOREIGN KEY (doc_id) REFERENCES documents(doc_id),
|
||||
CONSTRAINT fk_semantic_blocks_section_id FOREIGN KEY (section_id) REFERENCES sections(id),
|
||||
CONSTRAINT uq_semantic_blocks_doc_semantic UNIQUE (doc_id, semantic_id)
|
||||
);
|
||||
|
||||
COMMENT ON TABLE semantic_blocks IS '语义块表,用于邻域扩展,恢复完整内容';
|
||||
COMMENT ON COLUMN semantic_blocks.block_type IS '类型:section_text(正文)、table(表格)、figure(图示)';
|
||||
COMMENT ON COLUMN semantic_blocks.source_ids IS '原始阿里云 layout 的 uniqueId 数组';
|
||||
COMMENT ON COLUMN semantic_blocks.text IS '完整语义内容,未被切分';
|
||||
|
||||
-- ==================== 向量块元数据表 ====================
|
||||
CREATE TABLE vector_chunks (
|
||||
id SERIAL PRIMARY KEY,
|
||||
doc_id VARCHAR(128) NOT NULL,
|
||||
chunk_id VARCHAR(64) NOT NULL, -- Milvus 主键
|
||||
semantic_id VARCHAR(64) NOT NULL, -- 关联语义块
|
||||
chunk_index INT NOT NULL, -- 切片序号(全局)
|
||||
piece_index INT, -- 同语义块内的切片序号
|
||||
page_start INT,
|
||||
page_end INT,
|
||||
section_title VARCHAR(512),
|
||||
text VARCHAR(2048), -- 切片文本(可选,缩短版用于展示)
|
||||
source_ids JSONB, -- 原始 layout IDs(JSON 数组)
|
||||
created_at TIMESTAMP DEFAULT NOW(),
|
||||
|
||||
CONSTRAINT fk_vector_chunks_doc_id FOREIGN KEY (doc_id) REFERENCES documents(doc_id),
|
||||
CONSTRAINT fk_vector_chunks_semantic_id FOREIGN KEY (doc_id, semantic_id)
|
||||
REFERENCES semantic_blocks(doc_id, semantic_id),
|
||||
CONSTRAINT uq_vector_chunks_doc_chunk UNIQUE (doc_id, chunk_id)
|
||||
);
|
||||
|
||||
COMMENT ON TABLE vector_chunks IS '向量块元数据表,用于快速关联查询';
|
||||
COMMENT ON COLUMN vector_chunks.chunk_id IS 'Milvus 向量库主键';
|
||||
COMMENT ON COLUMN vector_chunks.piece_index IS '同语义块内的切片序号,用于按序拼接';
|
||||
|
||||
-- ==================== 索引 ====================
|
||||
CREATE INDEX idx_sections_doc_id ON sections(doc_id);
|
||||
CREATE INDEX idx_sections_parent_id ON sections(parent_id);
|
||||
CREATE INDEX idx_sections_level ON sections(level);
|
||||
|
||||
CREATE INDEX idx_semantic_blocks_doc_id ON semantic_blocks(doc_id);
|
||||
CREATE INDEX idx_semantic_blocks_section_id ON semantic_blocks(section_id);
|
||||
CREATE INDEX idx_semantic_blocks_block_type ON semantic_blocks(block_type);
|
||||
CREATE INDEX idx_semantic_blocks_semantic_id ON semantic_blocks(semantic_id);
|
||||
|
||||
CREATE INDEX idx_vector_chunks_doc_id ON vector_chunks(doc_id);
|
||||
CREATE INDEX idx_vector_chunks_semantic_id ON vector_chunks(semantic_id);
|
||||
CREATE INDEX idx_vector_chunks_chunk_id ON vector_chunks(chunk_id);
|
||||
|
||||
-- ==================== 触发器:自动更新 updated_at ====================
|
||||
CREATE OR REPLACE FUNCTION update_updated_at()
|
||||
RETURNS TRIGGER AS $$
|
||||
BEGIN
|
||||
NEW.updated_at = NOW();
|
||||
RETURN NEW;
|
||||
END;
|
||||
$$ LANGUAGE plpgsql;
|
||||
|
||||
CREATE TRIGGER tr_documents_updated_at
|
||||
BEFORE UPDATE ON documents
|
||||
FOR EACH ROW EXECUTE FUNCTION update_updated_at();
|
||||
327
backend/aliyun_parser/upload_to_milvus.py
Normal file
327
backend/aliyun_parser/upload_to_milvus.py
Normal file
@@ -0,0 +1,327 @@
|
||||
#!/usr/bin/env python3
|
||||
# -*- coding: utf-8 -*-
|
||||
"""
|
||||
将 vector_chunks.json 向量化并上传到 Milvus 和 PostgreSQL
|
||||
使用中转站的 OpenAI 兼容 API
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import time
|
||||
from pathlib import Path
|
||||
from typing import List, Dict
|
||||
|
||||
import psycopg2
|
||||
from psycopg2.extras import execute_values
|
||||
from pymilvus import (
|
||||
connections,
|
||||
Collection,
|
||||
FieldSchema,
|
||||
CollectionSchema,
|
||||
DataType,
|
||||
utility,
|
||||
)
|
||||
from openai import OpenAI
|
||||
|
||||
# ===================== 配置 =====================
|
||||
# 中转站配置
|
||||
RELAY_BASE_URL = "http://6.86.80.4:30080/v1"
|
||||
RELAY_API_KEY = "sk-5HeY7gfSIlyZMacfuXOf5cphpymsNqufEu1ou4U3avbULcyY"
|
||||
EMBEDDING_MODEL = "text-embedding-v3" # 中转站支持的 embedding 模型
|
||||
|
||||
# Milvus 配置
|
||||
MILVUS_HOST = "localhost"
|
||||
MILVUS_PORT = "19530"
|
||||
COLLECTION_NAME = "regulation_chunks"
|
||||
|
||||
# PostgreSQL 配置
|
||||
PG_HOST = "6.86.80.10"
|
||||
PG_PORT = 5432
|
||||
PG_USER = "postgresql"
|
||||
PG_PASSWORD = "postgresql123456"
|
||||
PG_DATABASE = "postgres"
|
||||
|
||||
|
||||
# ===================== Embedding =====================
|
||||
def get_openai_client(api_key: str, base_url: str) -> OpenAI:
|
||||
"""创建 OpenAI 客户端连接到中转站"""
|
||||
return OpenAI(api_key=api_key, base_url=base_url)
|
||||
|
||||
|
||||
def get_embeddings_batch(client: OpenAI, texts: List[str], batch_size: int = 10) -> List[List[float]]:
|
||||
"""批量获取文本向量"""
|
||||
all_embeddings = []
|
||||
|
||||
for i in range(0, len(texts), batch_size):
|
||||
batch = texts[i:i + batch_size]
|
||||
print(f"Embedding batch {i // batch_size + 1}/{(len(texts) - 1) // batch_size + 1}...")
|
||||
|
||||
response = client.embeddings.create(
|
||||
model=EMBEDDING_MODEL,
|
||||
input=batch,
|
||||
)
|
||||
|
||||
embeddings = [item.embedding for item in response.data]
|
||||
all_embeddings.extend(embeddings)
|
||||
|
||||
return all_embeddings
|
||||
|
||||
|
||||
# ===================== Milvus =====================
|
||||
def init_milvus(host: str, port: str):
|
||||
connections.connect("default", host=host, port=port)
|
||||
print(f"已连接 Milvus: {host}:{port}")
|
||||
|
||||
|
||||
def create_collection(name: str, dim: int) -> Collection:
|
||||
"""创建或获取 collection"""
|
||||
if utility.has_collection(name):
|
||||
print(f"Collection '{name}' 已存在,删除重建")
|
||||
utility.drop_collection(name)
|
||||
|
||||
fields = [
|
||||
FieldSchema(name="chunk_id", dtype=DataType.VARCHAR, max_length=64, is_primary=True),
|
||||
FieldSchema(name="doc_id", dtype=DataType.VARCHAR, max_length=128),
|
||||
FieldSchema(name="doc_title", dtype=DataType.VARCHAR, max_length=512),
|
||||
FieldSchema(name="chunk_index", dtype=DataType.INT64),
|
||||
FieldSchema(name="semantic_id", dtype=DataType.VARCHAR, max_length=64),
|
||||
FieldSchema(name="chunk_type", dtype=DataType.VARCHAR, max_length=32),
|
||||
FieldSchema(name="page_start", dtype=DataType.INT64),
|
||||
FieldSchema(name="page_end", dtype=DataType.INT64),
|
||||
FieldSchema(name="section_title", dtype=DataType.VARCHAR, max_length=512),
|
||||
FieldSchema(name="text", dtype=DataType.VARCHAR, max_length=2048),
|
||||
FieldSchema(name="source_ids", dtype=DataType.VARCHAR, max_length=4096), # JSON 字符串
|
||||
FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=dim),
|
||||
]
|
||||
|
||||
schema = CollectionSchema(fields, description="法规文档检索 chunks")
|
||||
collection = Collection(name, schema)
|
||||
|
||||
# 创建向量索引(IVF_FLAT,适合中小规模)
|
||||
index_params = {
|
||||
"metric_type": "COSINE",
|
||||
"index_type": "IVF_FLAT",
|
||||
"params": {"nlist": 128},
|
||||
}
|
||||
collection.create_index("embedding", index_params)
|
||||
print(f"Collection '{name}' 创建完成,索引已建立")
|
||||
|
||||
return collection
|
||||
|
||||
|
||||
def insert_chunks(collection: Collection, chunks: List[Dict], embeddings: List[List[float]]):
|
||||
"""插入 chunks 到 Milvus"""
|
||||
data = [
|
||||
[c["chunk_id"] for c in chunks],
|
||||
[c["doc_id"] for c in chunks],
|
||||
[c["doc_title"] for c in chunks],
|
||||
[c["chunk_index"] for c in chunks],
|
||||
[c["semantic_id"] for c in chunks],
|
||||
[c["chunk_type"] for c in chunks],
|
||||
[c["page_start"] for c in chunks],
|
||||
[c["page_end"] for c in chunks],
|
||||
[c["section_title"] for c in chunks],
|
||||
[c["text"] for c in chunks],
|
||||
[json.dumps(c.get("source_ids", [])) for c in chunks], # JSON 字符串
|
||||
embeddings,
|
||||
]
|
||||
|
||||
collection.insert(data)
|
||||
collection.flush()
|
||||
print(f"已插入 {len(chunks)} 个 chunks")
|
||||
|
||||
|
||||
def load_collection(collection: Collection):
|
||||
"""加载 collection 到内存(搜索前必须)"""
|
||||
collection.load()
|
||||
print(f"Collection 已加载到内存")
|
||||
|
||||
|
||||
# ===================== PostgreSQL =====================
|
||||
def get_pg_connection(host: str, port: int, user: str, password: str, database: str):
|
||||
"""获取 PostgreSQL 连接"""
|
||||
conn = psycopg2.connect(
|
||||
host=host,
|
||||
port=port,
|
||||
user=user,
|
||||
password=password,
|
||||
database=database,
|
||||
)
|
||||
print(f"已连接 PostgreSQL: {host}:{port}/{database}")
|
||||
return conn
|
||||
|
||||
|
||||
def insert_chunks_to_pg(conn, chunks: List[Dict], doc_data: Dict):
|
||||
"""插入 chunks 和相关数据到 PostgreSQL"""
|
||||
cursor = conn.cursor()
|
||||
|
||||
try:
|
||||
# 1. 插入文档
|
||||
cursor.execute("""
|
||||
INSERT INTO documents (doc_id, title, standard_number, upload_time)
|
||||
VALUES (%s, %s, %s, NOW())
|
||||
ON CONFLICT (doc_id) DO UPDATE SET title = EXCLUDED.title, updated_at = NOW()
|
||||
""", (doc_data["doc_id"], doc_data["doc_title"], doc_data.get("standard_number")))
|
||||
|
||||
# 2. 插入语义块
|
||||
semantic_blocks = doc_data.get("semantic_blocks", [])
|
||||
if semantic_blocks:
|
||||
block_rows = [
|
||||
(
|
||||
doc_data["doc_id"],
|
||||
block["semantic_id"],
|
||||
block["block_type"],
|
||||
block["page_start"],
|
||||
block["page_end"],
|
||||
block.get("section_title"),
|
||||
block.get("section_level"),
|
||||
json.dumps(block.get("source_ids", [])),
|
||||
block["text"],
|
||||
)
|
||||
for block in semantic_blocks
|
||||
]
|
||||
execute_values(
|
||||
cursor,
|
||||
"""
|
||||
INSERT INTO semantic_blocks
|
||||
(doc_id, semantic_id, block_type, page_start, page_end, section_title, section_level, source_ids, text)
|
||||
VALUES %s
|
||||
ON CONFLICT (doc_id, semantic_id) DO UPDATE SET text = EXCLUDED.text
|
||||
""",
|
||||
block_rows,
|
||||
)
|
||||
print(f"已插入 {len(semantic_blocks)} 个语义块")
|
||||
|
||||
# 3. 插入向量块元数据
|
||||
chunk_rows = [
|
||||
(
|
||||
doc_data["doc_id"],
|
||||
chunk["chunk_id"],
|
||||
chunk["semantic_id"],
|
||||
chunk["chunk_index"],
|
||||
chunk.get("piece_index"),
|
||||
chunk["page_start"],
|
||||
chunk["page_end"],
|
||||
chunk.get("section_title"),
|
||||
chunk["text"],
|
||||
json.dumps(chunk.get("source_ids", [])),
|
||||
)
|
||||
for chunk in chunks
|
||||
]
|
||||
execute_values(
|
||||
cursor,
|
||||
"""
|
||||
INSERT INTO vector_chunks
|
||||
(doc_id, chunk_id, semantic_id, chunk_index, piece_index, page_start, page_end, section_title, text, source_ids)
|
||||
VALUES %s
|
||||
ON CONFLICT (doc_id, chunk_id) DO UPDATE SET text = EXCLUDED.text
|
||||
""",
|
||||
chunk_rows,
|
||||
)
|
||||
print(f"已插入 {len(chunks)} 个向量块元数据")
|
||||
|
||||
conn.commit()
|
||||
print("PostgreSQL 数据插入完成")
|
||||
|
||||
except Exception as e:
|
||||
conn.rollback()
|
||||
raise e
|
||||
finally:
|
||||
cursor.close()
|
||||
|
||||
|
||||
# ===================== 主流程 =====================
|
||||
def load_data(file_path: Path) -> Dict:
|
||||
"""加载 vector_chunks.json,返回完整数据"""
|
||||
data = json.loads(file_path.read_text(encoding="utf-8"))
|
||||
return data
|
||||
|
||||
|
||||
def upload_to_milvus_and_pg(
|
||||
chunks_file: str,
|
||||
api_key: str,
|
||||
base_url: str,
|
||||
milvus_host: str,
|
||||
milvus_port: str,
|
||||
collection_name: str,
|
||||
batch_size: int,
|
||||
pg_host: str,
|
||||
pg_port: int,
|
||||
pg_user: str,
|
||||
pg_password: str,
|
||||
pg_database: str,
|
||||
):
|
||||
# 1. 加载完整数据
|
||||
chunks_path = Path(chunks_file).expanduser().resolve()
|
||||
if not chunks_path.exists():
|
||||
raise FileNotFoundError(f"文件不存在: {chunks_path}")
|
||||
|
||||
data = load_data(chunks_path)
|
||||
chunks = data.get("vector_chunks", [])
|
||||
if not chunks:
|
||||
raise ValueError("vector_chunks 为空")
|
||||
print(f"加载 {len(chunks)} 个 chunks")
|
||||
|
||||
# 2. 初始化连接
|
||||
client = get_openai_client(api_key, base_url)
|
||||
init_milvus(milvus_host, milvus_port)
|
||||
pg_conn = get_pg_connection(pg_host, pg_port, pg_user, pg_password, pg_database)
|
||||
|
||||
# 3. 获取 embeddings
|
||||
texts = [c["embedding_text"] for c in chunks]
|
||||
embeddings = get_embeddings_batch(client, texts, batch_size)
|
||||
print(f"生成 {len(embeddings)} 个向量")
|
||||
|
||||
# 4. 获取 embedding 维度
|
||||
embedding_dim = len(embeddings[0])
|
||||
print(f"Embedding 维度: {embedding_dim}")
|
||||
|
||||
# 5. 创建 collection 并插入 Milvus
|
||||
collection = create_collection(collection_name, embedding_dim)
|
||||
insert_chunks(collection, chunks, embeddings)
|
||||
load_collection(collection)
|
||||
|
||||
# 6. 插入 PostgreSQL
|
||||
insert_chunks_to_pg(pg_conn, chunks, data)
|
||||
|
||||
# 7. 关闭连接
|
||||
pg_conn.close()
|
||||
|
||||
print("上传完成!")
|
||||
|
||||
|
||||
# ===================== CLI =====================
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(description="将 vector_chunks 向量化并上传到 Milvus 和 PostgreSQL")
|
||||
parser.add_argument("chunks_file", help="vector_chunks.json 文件路径")
|
||||
parser.add_argument("--api-key", default=RELAY_API_KEY, help="中转站 API Key")
|
||||
parser.add_argument("--base-url", default=RELAY_BASE_URL, help="中转站 Base URL")
|
||||
parser.add_argument("--milvus-host", default=MILVUS_HOST, help="Milvus host")
|
||||
parser.add_argument("--milvus-port", default=MILVUS_PORT, help="Milvus port")
|
||||
parser.add_argument("--collection", default=COLLECTION_NAME, help="Milvus collection 名称")
|
||||
parser.add_argument("--batch-size", type=int, default=10, help="Embedding 批量大小(中转站限制最大10)")
|
||||
parser.add_argument("--pg-host", default=PG_HOST, help="PostgreSQL host")
|
||||
parser.add_argument("--pg-port", type=int, default=PG_PORT, help="PostgreSQL port")
|
||||
parser.add_argument("--pg-user", default=PG_USER, help="PostgreSQL user")
|
||||
parser.add_argument("--pg-password", default=PG_PASSWORD, help="PostgreSQL password")
|
||||
parser.add_argument("--pg-database", default=PG_DATABASE, help="PostgreSQL database")
|
||||
args = parser.parse_args()
|
||||
|
||||
upload_to_milvus_and_pg(
|
||||
chunks_file=args.chunks_file,
|
||||
api_key=args.api_key,
|
||||
base_url=args.base_url,
|
||||
milvus_host=args.milvus_host,
|
||||
milvus_port=args.milvus_port,
|
||||
collection_name=args.collection,
|
||||
batch_size=args.batch_size,
|
||||
pg_host=args.pg_host,
|
||||
pg_port=args.pg_port,
|
||||
pg_user=args.pg_user,
|
||||
pg_password=args.pg_password,
|
||||
pg_database=args.pg_database,
|
||||
)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
5212
backend/aliyun_parser/vector_chunks.json
Normal file
5212
backend/aliyun_parser/vector_chunks.json
Normal file
File diff suppressed because it is too large
Load Diff
263
backend/aliyun_parser/嵌入和召回.md
Normal file
263
backend/aliyun_parser/嵌入和召回.md
Normal file
@@ -0,0 +1,263 @@
|
||||
# 文档解析与向量检索说明
|
||||
|
||||
## 相关文件
|
||||
|
||||
- `aliyun_doc_parser.py`:调用阿里云文档智能解析 PDF,生成原始 `layouts.json`
|
||||
- `layouts_to_vector_chunks.py`:把 `layouts.json` 转成适合向量数据库入库的三层结构
|
||||
- `layouts.json`:阿里云返回的原始布局结果
|
||||
- `vector_chunks.json`:转换后的结构化输出
|
||||
|
||||
## 一、`layouts.json` 的结构
|
||||
|
||||
`layouts.json` 顶层是一个数组,每个元素代表一个布局块(layout)。常见字段如下:
|
||||
|
||||
- `type`:主类型,例如 `title`、`text`、`table`、`figure`
|
||||
- `subType`:更细的语义类型,例如 `doc_title`、`para_title`、`para`、`picture`、`pic_title`、`pic_caption`
|
||||
- `text`:当前布局块的纯文本
|
||||
- `markdownContent`:带 markdown 标记的文本
|
||||
- `pageNum`:页码
|
||||
- `index`:页内顺序
|
||||
- `level`:标题层级
|
||||
- `uniqueId`:布局块唯一标识
|
||||
- `blocks`:更细粒度的文本与样式信息
|
||||
- `cells`:表格单元格,仅 `table` 类型存在
|
||||
|
||||
这个结构不是简单 OCR 文本流,而是已经带有版面理解和语义分类的结构化数据。
|
||||
|
||||
## 二、推荐的三层转换结构
|
||||
|
||||
### 1. 结构层 `structure_nodes`
|
||||
|
||||
结构层用于恢复文档标题树,不直接作为最终向量检索单元。
|
||||
|
||||
示例:
|
||||
|
||||
- `1 范围`
|
||||
- `2 规范性引用文件`
|
||||
- `3 术语和定义`
|
||||
- `3.1 儿童三轮车`
|
||||
- `3.2 轮距`
|
||||
|
||||
结构层主要用于给下游 chunk 绑定 `section_path`。
|
||||
|
||||
### 2. 语义层 `semantic_blocks`
|
||||
|
||||
语义层是按文档意义聚合后的内容块,主要分为三类:
|
||||
|
||||
- `section_text`:同一章节下连续正文聚合而成
|
||||
- `table`:表格内容单独成块
|
||||
- `figure`:图、图名、图注等单独成块
|
||||
|
||||
这一层比单 layout 更适合做语义理解,也适合后续做上下文扩展。
|
||||
|
||||
### 3. 检索层 `vector_chunks`
|
||||
|
||||
检索层是最终写进向量数据库的 chunk。
|
||||
|
||||
处理方式:
|
||||
|
||||
- 对 `semantic_blocks` 中较短的块直接入库
|
||||
- 对较长的块按 `max_chars` 再切分
|
||||
- 相邻切片保留 `overlap_chars` 重叠
|
||||
- 每个 chunk 都带完整 metadata,便于后续过滤、重排和邻域扩展
|
||||
|
||||
## 三、当前转换脚本做了什么
|
||||
|
||||
`layouts_to_vector_chunks.py` 当前已经实现:
|
||||
|
||||
1. 过滤目录页噪声(如 `目次`)
|
||||
2. 根据标题层级维护章节路径
|
||||
3. 将正文聚合成 `section_text`
|
||||
4. 将表格单独转成 `table`
|
||||
5. 将图相关内容单独转成 `figure`
|
||||
6. 对长文本继续切分为最终 `vector_chunks`
|
||||
7. 为每个检索 chunk 生成 `embedding_text`
|
||||
|
||||
## 四、为什么不要直接按 layout 入库
|
||||
|
||||
如果把 `layouts.json` 的每条 layout 直接做向量:
|
||||
|
||||
- 颗粒度太碎
|
||||
- 标题和正文容易分离
|
||||
- 表格会丢失结构上下文
|
||||
- 图示信息无法完整表达
|
||||
- 检索命中结果噪声较大
|
||||
|
||||
对于标准文档,最合适的单位通常不是“句子”,而是“条款语义块”。
|
||||
|
||||
## 五、建议的入库字段
|
||||
|
||||
建议向量数据库每条记录至少保存:
|
||||
|
||||
- `embedding_text`:用于生成向量
|
||||
- `text`:原始 chunk 文本
|
||||
- `chunk_id`
|
||||
- `semantic_id`
|
||||
- `chunk_type`:`section_text` / `table` / `figure`
|
||||
- `section_path`
|
||||
- `section_title`
|
||||
- `section_level`
|
||||
- `page_start`
|
||||
- `page_end`
|
||||
- `doc_id`
|
||||
- `doc_title`
|
||||
- `source_ids`
|
||||
|
||||
其中:
|
||||
|
||||
- 向量化字段:`embedding_text`
|
||||
- 展示字段:`text`
|
||||
- 检索增强字段:其余 metadata
|
||||
|
||||
## 六、推荐的检索方式
|
||||
|
||||
不要只做最简单的 top-k 向量搜索,建议采用:
|
||||
|
||||
**向量召回 + metadata 重排 + 邻域扩展**
|
||||
|
||||
### 1. 向量召回
|
||||
|
||||
使用 `vector_chunks[*].embedding_text` 做 embedding,并在向量数据库中检索 top 10 ~ 15 条。
|
||||
|
||||
查询时可以对用户问题做轻微改写,例如:
|
||||
|
||||
原问题:
|
||||
|
||||
`儿童三轮车的定义是什么?`
|
||||
|
||||
可改写为:
|
||||
|
||||
`请检索 GB 14747—2006 儿童三轮车安全要求 中关于“儿童三轮车定义”的条款、术语、表格或图示说明。`
|
||||
|
||||
这样更适合标准文档检索。
|
||||
|
||||
### 2. metadata 重排
|
||||
|
||||
向量召回后,根据 metadata 做轻量规则重排。
|
||||
|
||||
常见规则:
|
||||
|
||||
- `chunk_type == section_text`:对定义类、要求类问题优先级更高
|
||||
- `section_path` 命中查询关键词:例如查询“定义”时,`术语和定义` 章节优先
|
||||
- `chunk_type == table`:对“尺寸 / 参数 / 数值 / 对照 / 要求”类问题加权
|
||||
- `chunk_type == figure`:对“图 / 结构 / 状态 / 示意”类问题加权
|
||||
|
||||
### 3. 邻域扩展
|
||||
|
||||
检索命中的是最终切片,但回答往往需要更完整上下文。
|
||||
|
||||
建议命中某个 `vector_chunk` 后:
|
||||
|
||||
1. 优先回捞同一个 `semantic_id` 下的所有 chunk
|
||||
2. 如果还不够,再补充同 `section_path`、相邻页码或相邻 `chunk_index` 的内容
|
||||
|
||||
这样可以恢复完整条款,而不是只给模型一小段碎片。
|
||||
|
||||
## 七、不同问题的检索重点
|
||||
|
||||
### 1. 定义类问题
|
||||
|
||||
例如:
|
||||
|
||||
- `儿童三轮车的定义是什么?`
|
||||
- `轮距是什么意思?`
|
||||
|
||||
优先检索:
|
||||
|
||||
- `section_text`
|
||||
- `section_path` 中包含 `术语和定义` 的内容
|
||||
|
||||
### 2. 要求类问题
|
||||
|
||||
例如:
|
||||
|
||||
- `外露突出物有什么要求?`
|
||||
- `辅助推杆有哪些安全要求?`
|
||||
|
||||
优先检索:
|
||||
|
||||
- `section_text`
|
||||
- `table`
|
||||
|
||||
### 3. 数值 / 尺寸 / 对照类问题
|
||||
|
||||
例如:
|
||||
|
||||
- `鞍座到脚蹬距离要求是什么?`
|
||||
- `哪些项目需要满足规定尺寸?`
|
||||
|
||||
优先检索:
|
||||
|
||||
- `table`
|
||||
- `section_text`
|
||||
|
||||
### 4. 图示说明类问题
|
||||
|
||||
例如:
|
||||
|
||||
- `正常乘骑状态是什么意思?`
|
||||
- `图1表示什么?`
|
||||
|
||||
优先检索:
|
||||
|
||||
- `figure`
|
||||
- 同章节相邻 `section_text`
|
||||
|
||||
## 八、推荐的最终检索流程
|
||||
|
||||
建议采用以下固定流程:
|
||||
|
||||
1. 用 `vector_chunks.embedding_text` 做 embedding 检索
|
||||
2. 取 top 10 ~ 15 条候选
|
||||
3. 按 `chunk_type + section_path` 做规则重排
|
||||
4. 以 `semantic_id` 为中心回捞完整语义块
|
||||
5. 选 3 ~ 5 组上下文提供给大模型回答
|
||||
|
||||
## 九、给大模型的上下文组织方式
|
||||
|
||||
最终不要直接把原始 JSON 扔给模型,建议整理成如下格式:
|
||||
|
||||
```text
|
||||
[命中片段 1]
|
||||
章节:3 术语和定义 > 3.1 儿童三轮车
|
||||
页码:1-2
|
||||
类型:section_text
|
||||
内容:
|
||||
......
|
||||
|
||||
[命中片段 2]
|
||||
章节:4 要求 > 4.3 外露突出物
|
||||
页码:5
|
||||
类型:section_text
|
||||
内容:
|
||||
......
|
||||
|
||||
[命中片段 3]
|
||||
章节:5 试验方法
|
||||
页码:8
|
||||
类型:table
|
||||
内容:
|
||||
......
|
||||
```
|
||||
|
||||
这种格式更利于模型稳定回答并引用出处。
|
||||
|
||||
## 十、转换命令
|
||||
|
||||
生成三层结构:
|
||||
|
||||
```bash
|
||||
python3 /home/huaci/dev/ai/SuperMew/tests/layouts_to_vector_chunks.py \
|
||||
--layouts /home/huaci/dev/ai/SuperMew/tests/layouts.json \
|
||||
--out /home/huaci/dev/ai/SuperMew/tests/vector_chunks.json
|
||||
```
|
||||
|
||||
自定义切片大小:
|
||||
|
||||
```bash
|
||||
python3 /home/huaci/dev/ai/SuperMew/tests/layouts_to_vector_chunks.py \
|
||||
--layouts /home/huaci/dev/ai/SuperMew/tests/layouts.json \
|
||||
--out /home/huaci/dev/ai/SuperMew/tests/vector_chunks.json \
|
||||
--max-chars 500 \
|
||||
--overlap-chars 80
|
||||
```
|
||||
@@ -3,6 +3,7 @@
|
||||
from contextlib import asynccontextmanager
|
||||
|
||||
from fastapi import FastAPI, Request
|
||||
from fastapi.encoders import jsonable_encoder
|
||||
from fastapi.middleware.cors import CORSMiddleware
|
||||
from fastapi.responses import JSONResponse
|
||||
from loguru import logger
|
||||
@@ -11,7 +12,8 @@ from app.api.models import ErrorResponse
|
||||
from app.api.routes import api_router
|
||||
from app.config.logging import setup_logging
|
||||
from app.config.settings import settings
|
||||
from app.services.llm.llm_factory import LLMFactory
|
||||
from app.shared.bootstrap import cleanup_runtime_dependencies, preload_runtime_dependencies
|
||||
from app.shared.errors import VectorStoreSchemaError
|
||||
# Keep module behavior explicit so the backend flow stays easy to audit.
|
||||
|
||||
|
||||
@@ -24,12 +26,12 @@ async def lifespan(app: FastAPI):
|
||||
logger.info(f"启动 {settings.app_name} v{settings.app_version}")
|
||||
logger.info(f"调试模式: {settings.debug}")
|
||||
logger.info("预加载LLM客户端...")
|
||||
LLMFactory.preload_clients(["qwen", "deepseek"])
|
||||
preload_runtime_dependencies()
|
||||
|
||||
yield
|
||||
|
||||
logger.info("应用关闭,执行清理...")
|
||||
LLMFactory.cleanup()
|
||||
cleanup_runtime_dependencies()
|
||||
|
||||
|
||||
app = FastAPI(
|
||||
@@ -55,16 +57,33 @@ app.add_middleware(
|
||||
app.include_router(api_router, prefix="/api/v1")
|
||||
|
||||
|
||||
@app.exception_handler(VectorStoreSchemaError)
|
||||
async def vector_store_schema_exception_handler(request: Request, exc: VectorStoreSchemaError):
|
||||
"""Return a stable JSON response for vector store schema/runtime errors."""
|
||||
logger.error(f"向量库 schema 异常: {exc}")
|
||||
return JSONResponse(
|
||||
status_code=500,
|
||||
content=jsonable_encoder(
|
||||
ErrorResponse(
|
||||
error="VectorStoreSchemaError",
|
||||
message=str(exc),
|
||||
)
|
||||
),
|
||||
)
|
||||
|
||||
|
||||
@app.exception_handler(Exception)
|
||||
async def global_exception_handler(request: Request, exc: Exception):
|
||||
"""Global exception handler."""
|
||||
logger.error(f"未处理的异常: {exc}")
|
||||
return JSONResponse(
|
||||
status_code=500,
|
||||
content=ErrorResponse(
|
||||
content=jsonable_encoder(
|
||||
ErrorResponse(
|
||||
error="InternalServerError",
|
||||
message=str(exc),
|
||||
).model_dump(),
|
||||
)
|
||||
),
|
||||
)
|
||||
|
||||
|
||||
|
||||
@@ -6,6 +6,8 @@ from .documents import router as documents_router
|
||||
from .knowledge import router as knowledge_router
|
||||
from .agent import router as agent_router
|
||||
from .status import router as status_router
|
||||
from .perception import router as perception_router
|
||||
from .rag import router as rag_router
|
||||
# Keep package boundaries explicit so backend imports stay predictable.
|
||||
|
||||
|
||||
@@ -18,6 +20,8 @@ api_router.include_router(knowledge_router)
|
||||
api_router.include_router(agent_router)
|
||||
api_router.include_router(compliance_router)
|
||||
api_router.include_router(status_router)
|
||||
api_router.include_router(perception_router)
|
||||
api_router.include_router(rag_router)
|
||||
|
||||
__all__ = [
|
||||
"api_router",
|
||||
@@ -26,4 +30,6 @@ __all__ = [
|
||||
"agent_router",
|
||||
"compliance_router",
|
||||
"status_router",
|
||||
"perception_router",
|
||||
"rag_router",
|
||||
]
|
||||
|
||||
@@ -20,7 +20,7 @@ from app.api.models import (
|
||||
)
|
||||
from app.config.settings import settings
|
||||
from app.shared.async_utils import iter_in_thread
|
||||
from app.shared.bootstrap import get_agent_conversation_service, get_conversation_store
|
||||
from app.shared.bootstrap import get_agent_conversation_service, get_agent_session_service
|
||||
# Keep route handlers close to their transport-layer wiring for easier auditing.
|
||||
|
||||
|
||||
@@ -65,7 +65,7 @@ async def chat_with_session(request: ChatRequest):
|
||||
model=request.model or settings.llm_model,
|
||||
top_k=request.top_k or settings.rag_top_k,
|
||||
)
|
||||
session = get_conversation_store().get_session(session_id)
|
||||
session = get_agent_session_service().get_session(session_id)
|
||||
return ChatResponse(
|
||||
session_id=session_id,
|
||||
answer=result.answer,
|
||||
@@ -133,45 +133,52 @@ async def chat_stream(request: ChatRequest):
|
||||
@router.get("/session/{session_id}", response_model=SessionInfo)
|
||||
async def get_session_info(session_id: str):
|
||||
"""Return session info."""
|
||||
session = get_conversation_store().get_session(session_id)
|
||||
if not session:
|
||||
raise HTTPException(status_code=404, detail="会话不存在或已过期")
|
||||
try:
|
||||
session = get_agent_session_service().get_session(session_id)
|
||||
return SessionInfo(
|
||||
session_id=session.session_id,
|
||||
message_count=len(session.messages),
|
||||
created_at=session.created_at,
|
||||
updated_at=session.updated_at,
|
||||
)
|
||||
except ValueError as exc:
|
||||
raise HTTPException(status_code=404, detail=str(exc))
|
||||
|
||||
|
||||
@router.get("/session/{session_id}/history")
|
||||
async def get_session_history(session_id: str, max_turns: int = 5):
|
||||
"""Return session history."""
|
||||
session = get_conversation_store().get_session(session_id)
|
||||
if not session:
|
||||
raise HTTPException(status_code=404, detail="会话不存在或已过期")
|
||||
history = [{"role": msg.role, "content": msg.content} for msg in session.messages[-(max_turns * 2):]]
|
||||
try:
|
||||
history = get_agent_session_service().get_history(session_id=session_id, max_turns=max_turns)
|
||||
return {"session_id": session_id, "history": history}
|
||||
except ValueError as exc:
|
||||
raise HTTPException(status_code=404, detail=str(exc))
|
||||
|
||||
|
||||
@router.delete("/session/{session_id}")
|
||||
async def delete_session(session_id: str):
|
||||
"""Delete session."""
|
||||
if not get_conversation_store().delete_session(session_id):
|
||||
raise HTTPException(status_code=404, detail="会话不存在")
|
||||
try:
|
||||
get_agent_session_service().delete_session(session_id)
|
||||
return {"message": "会话已删除", "session_id": session_id}
|
||||
except ValueError as exc:
|
||||
raise HTTPException(status_code=404, detail=str(exc))
|
||||
|
||||
|
||||
@router.get("/sessions", response_model=List[SessionInfo])
|
||||
async def list_sessions():
|
||||
"""List sessions."""
|
||||
return [SessionInfo(**item) for item in get_conversation_store().list_sessions()]
|
||||
return [SessionInfo(**item) for item in get_agent_session_service().list_sessions()]
|
||||
|
||||
|
||||
@router.post("/feedback")
|
||||
async def submit_feedback(request: FeedbackRequest):
|
||||
"""Submit feedback."""
|
||||
session = get_conversation_store().get_session(request.session_id)
|
||||
if not session:
|
||||
raise HTTPException(status_code=404, detail="会话不存在")
|
||||
return {"message": "反馈已提交", "session_id": request.session_id, "message_index": request.message_index}
|
||||
try:
|
||||
result = get_agent_session_service().submit_feedback(
|
||||
session_id=request.session_id,
|
||||
message_index=request.message_index,
|
||||
)
|
||||
return {"message": "反馈已提交", "session_id": result.session_id, "message_index": result.message_index}
|
||||
except ValueError as exc:
|
||||
raise HTTPException(status_code=404, detail=str(exc))
|
||||
|
||||
@@ -29,14 +29,19 @@ async def search_knowledge(request: SearchRequest):
|
||||
results=[
|
||||
SearchResultItem(
|
||||
id=index + 1,
|
||||
content=item.content,
|
||||
content=item.text,
|
||||
score=item.score,
|
||||
metadata={
|
||||
"doc_id": item.doc_id,
|
||||
"doc_name": item.doc_name,
|
||||
"doc_title": item.doc_title,
|
||||
"chunk_id": item.chunk_id,
|
||||
"chunk_type": item.chunk_type,
|
||||
"section_title": item.section_title,
|
||||
"page_number": item.page_number,
|
||||
"page_start": item.page_start,
|
||||
"page_end": item.page_end,
|
||||
"section_level": item.section_level,
|
||||
"chunk_index": item.chunk_index,
|
||||
"piece_index": item.piece_index,
|
||||
**item.metadata,
|
||||
},
|
||||
)
|
||||
|
||||
67
backend/app/api/routes/perception.py
Normal file
67
backend/app/api/routes/perception.py
Normal file
@@ -0,0 +1,67 @@
|
||||
"""Define API routes for perception (regulatory intelligence)."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
|
||||
from fastapi import APIRouter, Query
|
||||
from fastapi.responses import StreamingResponse
|
||||
|
||||
from app.shared.bootstrap import get_perception_service
|
||||
from app.shared.async_utils import iter_in_thread
|
||||
|
||||
router = APIRouter(prefix="/perception", tags=["智能感知"])
|
||||
|
||||
|
||||
@router.get("/stats")
|
||||
async def get_perception_stats():
|
||||
"""Return KPI statistics for the perception dashboard."""
|
||||
return get_perception_service().get_stats()
|
||||
|
||||
|
||||
@router.get("/events")
|
||||
async def list_events(
|
||||
source: str | None = Query(default=None, description="来源筛选 (MIIT/UN-ECE/ISO/国标委/EUR-Lex/IATF)"),
|
||||
impact_level: str | None = Query(default=None, description="影响等级 (high/medium/low)"),
|
||||
limit: int = Query(default=50, ge=1, le=100),
|
||||
):
|
||||
"""Return regulatory events with optional filters."""
|
||||
events = get_perception_service().list_events(
|
||||
source=source,
|
||||
impact_level=impact_level,
|
||||
limit=limit,
|
||||
)
|
||||
return {"events": events, "total": len(events)}
|
||||
|
||||
|
||||
@router.get("/events/{event_id}")
|
||||
async def get_event(event_id: str):
|
||||
"""Return a single regulatory event by ID."""
|
||||
event = get_perception_service().get_event(event_id)
|
||||
if event is None:
|
||||
from fastapi import HTTPException
|
||||
raise HTTPException(status_code=404, detail=f"Event {event_id} not found")
|
||||
return event
|
||||
|
||||
|
||||
@router.post("/events/{event_id}/analyze")
|
||||
async def analyze_event(event_id: str):
|
||||
"""Stream SSE impact analysis for a regulatory event."""
|
||||
service = get_perception_service()
|
||||
|
||||
async def event_stream():
|
||||
async for item in iter_in_thread(service.analyze_event(event_id)):
|
||||
event_name = item.get("event", "message")
|
||||
data = item.get("data", "")
|
||||
if isinstance(data, (dict, list)):
|
||||
data = json.dumps(data, ensure_ascii=False)
|
||||
yield f"event: {event_name}\ndata: {data}\n\n"
|
||||
|
||||
return StreamingResponse(
|
||||
event_stream(),
|
||||
media_type="text/event-stream",
|
||||
headers={
|
||||
"Cache-Control": "no-cache",
|
||||
"X-Accel-Buffering": "no",
|
||||
},
|
||||
)
|
||||
@@ -50,8 +50,8 @@ async def rag_chat(request: RagChatRequest):
|
||||
{
|
||||
"id": str(s.get("chunk_id") or s.get("doc_id") or idx + 1),
|
||||
"score": s.get("score", 0),
|
||||
"preview": s.get("content", "")[:200],
|
||||
"doc_name": s.get("doc_name", ""),
|
||||
"preview": s.get("text", s.get("content", ""))[:200],
|
||||
"doc_name": s.get("doc_title", s.get("doc_name", "")),
|
||||
"clause": s.get("section_title", "法规片段"),
|
||||
"doc_id": s.get("doc_id"),
|
||||
"download_url": (
|
||||
|
||||
@@ -1,7 +1,7 @@
|
||||
"""Initialize the app.application.agent package."""
|
||||
|
||||
from .services import AgentConversationService
|
||||
from .services import AgentConversationService, AgentSessionFeedbackResult, AgentSessionService
|
||||
# Keep package boundaries explicit so backend imports stay predictable.
|
||||
|
||||
|
||||
__all__ = ["AgentConversationService"]
|
||||
__all__ = ["AgentConversationService", "AgentSessionFeedbackResult", "AgentSessionService"]
|
||||
|
||||
@@ -1,7 +1,8 @@
|
||||
"""Implement application-layer logic for services."""
|
||||
"""Implement application-layer logic for agent services."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from dataclasses import dataclass
|
||||
from typing import Generator
|
||||
|
||||
from app.domain.conversation import AnswerGenerator, AnswerResult, ConversationStore
|
||||
@@ -143,3 +144,48 @@ class AgentConversationService:
|
||||
)
|
||||
|
||||
return session.session_id, event_stream()
|
||||
|
||||
|
||||
@dataclass
|
||||
class AgentSessionFeedbackResult:
|
||||
"""Represent the result of storing session feedback."""
|
||||
|
||||
session_id: str
|
||||
message_index: int
|
||||
|
||||
|
||||
class AgentSessionService:
|
||||
"""Provide application-layer access to session management workflows."""
|
||||
|
||||
def __init__(self, *, conversation_store: ConversationStore) -> None:
|
||||
"""Initialize the Agent Session Service instance."""
|
||||
self.conversation_store = conversation_store
|
||||
|
||||
def get_session(self, session_id: str):
|
||||
"""Return a session by id or raise when it does not exist."""
|
||||
session = self.conversation_store.get_session(session_id)
|
||||
if not session:
|
||||
raise ValueError("会话不存在或已过期")
|
||||
return session
|
||||
|
||||
def get_history(self, *, session_id: str, max_turns: int = 5) -> list[dict[str, str]]:
|
||||
"""Return the recent conversation history for a session."""
|
||||
session = self.get_session(session_id)
|
||||
return [{"role": msg.role, "content": msg.content} for msg in session.messages[-(max_turns * 2):]]
|
||||
|
||||
def delete_session(self, session_id: str) -> None:
|
||||
"""Delete a session or raise when it does not exist."""
|
||||
if not self.conversation_store.delete_session(session_id):
|
||||
raise ValueError("会话不存在")
|
||||
|
||||
def list_sessions(self) -> list[dict]:
|
||||
"""Return the list of visible sessions."""
|
||||
return self.conversation_store.list_sessions()
|
||||
|
||||
def submit_feedback(self, *, session_id: str, message_index: int) -> AgentSessionFeedbackResult:
|
||||
"""Validate feedback targets and return a normalized feedback result."""
|
||||
session = self.get_session(session_id)
|
||||
if message_index < 0 or message_index >= len(session.messages):
|
||||
raise ValueError("消息索引不存在")
|
||||
# Preserve the existing API behavior until a persistent feedback store is introduced.
|
||||
return AgentSessionFeedbackResult(session_id=session_id, message_index=message_index)
|
||||
|
||||
@@ -7,16 +7,22 @@ import tempfile
|
||||
import uuid
|
||||
import json
|
||||
from dataclasses import dataclass
|
||||
from datetime import UTC, datetime
|
||||
|
||||
from loguru import logger
|
||||
from app.config.settings import settings
|
||||
|
||||
from app.domain.documents import (
|
||||
ChunkBuilder,
|
||||
Document,
|
||||
DocumentArtifact,
|
||||
DocumentBinaryStore,
|
||||
DocumentParser,
|
||||
DocumentProcessingRun,
|
||||
DocumentProcessingStore,
|
||||
DocumentRepository,
|
||||
DocumentStatus,
|
||||
DocumentStatusEvent,
|
||||
ParseArtifactStore,
|
||||
ParsedDocument,
|
||||
)
|
||||
@@ -39,6 +45,7 @@ class DocumentProcessResult:
|
||||
|
||||
class DocumentCommandService:
|
||||
"""Provide the Document Command Service service."""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
*,
|
||||
@@ -49,6 +56,7 @@ class DocumentCommandService:
|
||||
embedding_provider: EmbeddingProvider,
|
||||
vector_index: VectorIndex,
|
||||
parse_artifact_store: ParseArtifactStore | None = None,
|
||||
document_processing_store: DocumentProcessingStore | None = None,
|
||||
) -> None:
|
||||
"""Initialize the Document Command Service instance."""
|
||||
self.document_repository = document_repository
|
||||
@@ -58,6 +66,11 @@ class DocumentCommandService:
|
||||
self.embedding_provider = embedding_provider
|
||||
self.vector_index = vector_index
|
||||
self.parse_artifact_store = parse_artifact_store
|
||||
self.document_processing_store = document_processing_store
|
||||
|
||||
def _utcnow(self) -> datetime:
|
||||
"""Return the current UTC timestamp for persisted processing metadata."""
|
||||
return datetime.now(UTC)
|
||||
|
||||
def _save_parse_artifacts(self, *, doc_id: str, parsed_document: ParsedDocument) -> dict[str, str]:
|
||||
"""Persist parse artifacts so troubleshooting does not depend on provider retention windows."""
|
||||
@@ -80,6 +93,143 @@ class DocumentCommandService:
|
||||
artifact_keys[name] = object_name
|
||||
return artifact_keys
|
||||
|
||||
def _safe_create_processing_run(self, *, doc_id: str, trigger_type: str, generate_summary: bool) -> str | None:
|
||||
"""Create a processing run record when the optional store is available."""
|
||||
if not self.document_processing_store:
|
||||
return None
|
||||
run = DocumentProcessingRun(
|
||||
run_id=str(uuid.uuid4()),
|
||||
doc_id=doc_id,
|
||||
trigger_type=trigger_type,
|
||||
run_status="running",
|
||||
parser_backend=settings.parser_backend,
|
||||
chunk_backend=settings.chunk_backend,
|
||||
embedding_model=settings.embedding_model,
|
||||
metadata={"generate_summary": generate_summary},
|
||||
)
|
||||
try:
|
||||
created = self.document_processing_store.create_run(run)
|
||||
return created.run_id
|
||||
except Exception:
|
||||
logger.warning("DocumentProcessingStore.create_run failed for doc_id={}", doc_id)
|
||||
return None
|
||||
|
||||
def _safe_append_status_event(
|
||||
self,
|
||||
*,
|
||||
doc_id: str,
|
||||
run_id: str | None,
|
||||
from_status: str,
|
||||
to_status: str,
|
||||
stage: str,
|
||||
message: str = "",
|
||||
metadata: dict | None = None,
|
||||
) -> None:
|
||||
"""Append a status event without allowing auxiliary persistence failures to abort processing."""
|
||||
if not self.document_processing_store or not run_id:
|
||||
return
|
||||
event = DocumentStatusEvent(
|
||||
event_id=str(uuid.uuid4()),
|
||||
doc_id=doc_id,
|
||||
run_id=run_id,
|
||||
from_status=from_status,
|
||||
to_status=to_status,
|
||||
stage=stage,
|
||||
message=message,
|
||||
metadata=metadata or {},
|
||||
)
|
||||
try:
|
||||
self.document_processing_store.append_status_event(event)
|
||||
except Exception:
|
||||
logger.warning(
|
||||
"DocumentProcessingStore.append_status_event failed for doc_id={}, run_id={}",
|
||||
doc_id,
|
||||
run_id,
|
||||
)
|
||||
|
||||
def _safe_mark_run_stored(self, *, doc_id: str, run_id: str | None) -> None:
|
||||
"""Mark the processing run as stored without affecting the main workflow."""
|
||||
if not self.document_processing_store or not run_id:
|
||||
return
|
||||
try:
|
||||
self.document_processing_store.mark_run_stored(run_id, stored_at=self._utcnow())
|
||||
except Exception:
|
||||
logger.warning("DocumentProcessingStore.mark_run_stored failed for doc_id={}, run_id={}", doc_id, run_id)
|
||||
|
||||
def _safe_mark_run_parsed(self, *, doc_id: str, run_id: str | None, parsed_document: ParsedDocument) -> None:
|
||||
"""Persist parse completion details without failing the document pipeline."""
|
||||
if not self.document_processing_store or not run_id:
|
||||
return
|
||||
try:
|
||||
self.document_processing_store.mark_run_parsed(
|
||||
run_id,
|
||||
parser_backend=parsed_document.parser_name,
|
||||
layout_count=int(parsed_document.metadata.get("layout_count", len(parsed_document.raw_layouts)) or 0),
|
||||
structure_node_count=len(parsed_document.structure_nodes),
|
||||
semantic_block_count=len(parsed_document.semantic_blocks),
|
||||
vector_chunk_count=len(parsed_document.vector_chunks),
|
||||
parsed_at=self._utcnow(),
|
||||
metadata={"parse_task_id": parsed_document.metadata.get("task_id", "")},
|
||||
)
|
||||
except Exception:
|
||||
logger.warning("DocumentProcessingStore.mark_run_parsed failed for doc_id={}, run_id={}", doc_id, run_id)
|
||||
|
||||
def _safe_replace_processing_artifacts(self, *, doc_id: str, run_id: str | None, artifact_keys: dict[str, str]) -> None:
|
||||
"""Store artifact references without turning persistence drift into a user-visible failure."""
|
||||
if not self.document_processing_store or not run_id:
|
||||
return
|
||||
artifacts = [
|
||||
DocumentArtifact(
|
||||
artifact_id=str(uuid.uuid4()),
|
||||
doc_id=doc_id,
|
||||
run_id=run_id,
|
||||
artifact_type=artifact_type,
|
||||
object_name=object_name,
|
||||
content_type="application/json",
|
||||
byte_size=0,
|
||||
checksum="",
|
||||
)
|
||||
for artifact_type, object_name in artifact_keys.items()
|
||||
]
|
||||
try:
|
||||
self.document_processing_store.replace_artifacts_for_run(run_id, artifacts)
|
||||
except Exception:
|
||||
logger.warning(
|
||||
"DocumentProcessingStore.replace_artifacts_for_run failed for doc_id={}, run_id={}",
|
||||
doc_id,
|
||||
run_id,
|
||||
)
|
||||
|
||||
def _safe_mark_run_indexed(self, *, doc_id: str, run_id: str | None, chunk_count: int, index_name: str) -> None:
|
||||
"""Mark the processing run as indexed without affecting the success path."""
|
||||
if not self.document_processing_store or not run_id:
|
||||
return
|
||||
now = self._utcnow()
|
||||
try:
|
||||
self.document_processing_store.mark_run_indexed(
|
||||
run_id,
|
||||
chunk_count=chunk_count,
|
||||
index_name=index_name,
|
||||
indexed_at=now,
|
||||
finished_at=now,
|
||||
)
|
||||
except Exception:
|
||||
logger.warning("DocumentProcessingStore.mark_run_indexed failed for doc_id={}, run_id={}", doc_id, run_id)
|
||||
|
||||
def _safe_mark_run_failed(self, *, doc_id: str, run_id: str | None, failure_stage: str, error_message: str) -> None:
|
||||
"""Mark the processing run as failed without masking the original error handling path."""
|
||||
if not self.document_processing_store or not run_id:
|
||||
return
|
||||
try:
|
||||
self.document_processing_store.mark_run_failed(
|
||||
run_id,
|
||||
failure_stage=failure_stage,
|
||||
error_message=error_message,
|
||||
finished_at=self._utcnow(),
|
||||
)
|
||||
except Exception:
|
||||
logger.warning("DocumentProcessingStore.mark_run_failed failed for doc_id={}, run_id={}", doc_id, run_id)
|
||||
|
||||
def upload_and_process(
|
||||
self,
|
||||
*,
|
||||
@@ -91,11 +241,15 @@ class DocumentCommandService:
|
||||
regulation_type: str,
|
||||
version: str,
|
||||
generate_summary: bool,
|
||||
trigger_type: str = "upload",
|
||||
) -> DocumentProcessResult:
|
||||
"""Handle upload and process for the Document Command Service instance."""
|
||||
doc_id = doc_id or str(uuid.uuid4())[:8]
|
||||
final_doc_name = doc_name or file_name
|
||||
object_name = f"{doc_id}/{file_name}"
|
||||
run_id: str | None = None
|
||||
current_status = DocumentStatus.PENDING
|
||||
current_stage = "store"
|
||||
|
||||
document = Document(
|
||||
doc_id=doc_id,
|
||||
@@ -109,6 +263,19 @@ class DocumentCommandService:
|
||||
metadata={"generate_summary": generate_summary},
|
||||
)
|
||||
self.document_repository.create(document)
|
||||
run_id = self._safe_create_processing_run(
|
||||
doc_id=doc_id,
|
||||
trigger_type=trigger_type,
|
||||
generate_summary=generate_summary,
|
||||
)
|
||||
self._safe_append_status_event(
|
||||
doc_id=doc_id,
|
||||
run_id=run_id,
|
||||
from_status="",
|
||||
to_status=DocumentStatus.PENDING.value,
|
||||
stage="document_created",
|
||||
message="Document record created",
|
||||
)
|
||||
|
||||
temp_path = ""
|
||||
try:
|
||||
@@ -119,6 +286,17 @@ class DocumentCommandService:
|
||||
metadata={"doc_id": doc_id},
|
||||
)
|
||||
self.document_repository.update_status(doc_id, DocumentStatus.STORED)
|
||||
current_status = DocumentStatus.STORED
|
||||
current_stage = "parse"
|
||||
self._safe_mark_run_stored(doc_id=doc_id, run_id=run_id)
|
||||
self._safe_append_status_event(
|
||||
doc_id=doc_id,
|
||||
run_id=run_id,
|
||||
from_status=DocumentStatus.PENDING.value,
|
||||
to_status=DocumentStatus.STORED.value,
|
||||
stage="store",
|
||||
message="Source file stored",
|
||||
)
|
||||
|
||||
suffix = os.path.splitext(file_name)[1]
|
||||
with tempfile.NamedTemporaryFile(delete=False, suffix=suffix) as temp_file:
|
||||
@@ -130,7 +308,13 @@ class DocumentCommandService:
|
||||
doc_id=doc_id,
|
||||
doc_name=final_doc_name,
|
||||
)
|
||||
self._safe_mark_run_parsed(doc_id=doc_id, run_id=run_id, parsed_document=parsed_document)
|
||||
|
||||
artifact_keys: dict[str, str] = {}
|
||||
try:
|
||||
artifact_keys = self._save_parse_artifacts(doc_id=doc_id, parsed_document=parsed_document)
|
||||
except Exception:
|
||||
logger.warning("Parse artifact binary persistence failed for doc_id={}", doc_id)
|
||||
self.document_repository.update_status(
|
||||
doc_id,
|
||||
DocumentStatus.PARSED,
|
||||
@@ -146,6 +330,18 @@ class DocumentCommandService:
|
||||
"processing_stage": "parsed",
|
||||
},
|
||||
)
|
||||
current_status = DocumentStatus.PARSED
|
||||
current_stage = "embed"
|
||||
self._safe_replace_processing_artifacts(doc_id=doc_id, run_id=run_id, artifact_keys=artifact_keys)
|
||||
self._safe_append_status_event(
|
||||
doc_id=doc_id,
|
||||
run_id=run_id,
|
||||
from_status=DocumentStatus.STORED.value,
|
||||
to_status=DocumentStatus.PARSED.value,
|
||||
stage="parse",
|
||||
message="Document parsed",
|
||||
metadata={"artifact_count": len(artifact_keys)},
|
||||
)
|
||||
if self.parse_artifact_store:
|
||||
try:
|
||||
self.parse_artifact_store.save(
|
||||
@@ -165,6 +361,7 @@ class DocumentCommandService:
|
||||
raise ValueError("解析完成但没有生成可入库的 chunks")
|
||||
|
||||
vectors = self.embedding_provider.embed_texts([chunk.embedding_text for chunk in chunks])
|
||||
current_stage = "index"
|
||||
inserted = self.vector_index.upsert(chunks, vectors)
|
||||
if inserted != len(chunks):
|
||||
logger.warning("Milvus upsert count mismatched: inserted={}, chunks={}", inserted, len(chunks))
|
||||
@@ -182,6 +379,23 @@ class DocumentCommandService:
|
||||
"processing_stage": "indexed",
|
||||
},
|
||||
)
|
||||
current_status = DocumentStatus.INDEXED
|
||||
index_name = health.get("collection_name", "")
|
||||
self._safe_mark_run_indexed(
|
||||
doc_id=doc_id,
|
||||
run_id=run_id,
|
||||
chunk_count=len(chunks),
|
||||
index_name=index_name,
|
||||
)
|
||||
self._safe_append_status_event(
|
||||
doc_id=doc_id,
|
||||
run_id=run_id,
|
||||
from_status=DocumentStatus.PARSED.value,
|
||||
to_status=DocumentStatus.INDEXED.value,
|
||||
stage="index",
|
||||
message="Document indexed",
|
||||
metadata={"chunk_count": len(chunks), "index_name": index_name},
|
||||
)
|
||||
stored = self.document_repository.get(doc_id)
|
||||
return DocumentProcessResult(
|
||||
doc_id=doc_id,
|
||||
@@ -194,6 +408,7 @@ class DocumentCommandService:
|
||||
)
|
||||
except Exception as exc:
|
||||
logger.exception("文档处理失败: doc_id={}", doc_id)
|
||||
failure_stage = current_stage
|
||||
self.document_repository.update_status(
|
||||
doc_id,
|
||||
DocumentStatus.FAILED,
|
||||
@@ -201,8 +416,23 @@ class DocumentCommandService:
|
||||
metadata={
|
||||
"failure_reason": str(exc),
|
||||
"processing_stage": "failed",
|
||||
"failure_stage": failure_stage,
|
||||
},
|
||||
)
|
||||
self._safe_mark_run_failed(
|
||||
doc_id=doc_id,
|
||||
run_id=run_id,
|
||||
failure_stage=failure_stage,
|
||||
error_message=str(exc),
|
||||
)
|
||||
self._safe_append_status_event(
|
||||
doc_id=doc_id,
|
||||
run_id=run_id,
|
||||
from_status=current_status.value,
|
||||
to_status=DocumentStatus.FAILED.value,
|
||||
stage=failure_stage,
|
||||
message=str(exc),
|
||||
)
|
||||
return DocumentProcessResult(
|
||||
doc_id=doc_id,
|
||||
doc_name=final_doc_name,
|
||||
@@ -235,6 +465,11 @@ class DocumentCommandService:
|
||||
self.parse_artifact_store.delete(doc_id)
|
||||
except Exception:
|
||||
logger.warning("ParseArtifactStore delete failed for doc_id={}", doc_id)
|
||||
if self.document_processing_store:
|
||||
try:
|
||||
self.document_processing_store.delete_by_document(doc_id)
|
||||
except Exception:
|
||||
logger.warning("DocumentProcessingStore delete failed for doc_id={}", doc_id)
|
||||
self.document_repository.delete(doc_id)
|
||||
return True
|
||||
|
||||
@@ -253,6 +488,7 @@ class DocumentCommandService:
|
||||
regulation_type=document.regulation_type,
|
||||
version=document.version,
|
||||
generate_summary=bool(document.metadata.get("generate_summary", False)),
|
||||
trigger_type="retry",
|
||||
)
|
||||
|
||||
|
||||
@@ -272,7 +508,7 @@ class DocumentQueryService:
|
||||
"""Return documents with real-time state from Milvus as the authoritative source.
|
||||
|
||||
Algorithm:
|
||||
1. Query Milvus for all doc metadata (doc_id, doc_name, chunk_count, …).
|
||||
1. Query Milvus for all doc metadata (doc_id, doc_title, chunk_count, …).
|
||||
2. Load JSON/PG metadata records and index them by doc_id.
|
||||
3. Merge: Milvus-present docs get status=INDEXED and live chunk_count;
|
||||
metadata-only docs with status=INDEXED are demoted to FAILED.
|
||||
@@ -300,8 +536,8 @@ class DocumentQueryService:
|
||||
doc.chunk_count = row["chunk_count"]
|
||||
doc.status = DocumentStatus.INDEXED
|
||||
# Backfill fields that may be missing from older JSON records.
|
||||
if not doc.doc_name and row.get("doc_name"):
|
||||
doc.doc_name = row["doc_name"]
|
||||
if not doc.doc_name and row.get("doc_title"):
|
||||
doc.doc_name = row["doc_title"]
|
||||
if not doc.regulation_type and row.get("regulation_type"):
|
||||
doc.regulation_type = row["regulation_type"]
|
||||
if not doc.version and row.get("version"):
|
||||
@@ -317,8 +553,8 @@ class DocumentQueryService:
|
||||
if doc_id not in meta_by_id:
|
||||
synthetic = Document(
|
||||
doc_id=doc_id,
|
||||
doc_name=row.get("doc_name", doc_id),
|
||||
file_name=row.get("doc_name", doc_id),
|
||||
doc_name=row.get("doc_title", doc_id),
|
||||
file_name=row.get("doc_title", doc_id),
|
||||
object_name="",
|
||||
content_type="",
|
||||
size_bytes=0,
|
||||
|
||||
@@ -29,11 +29,16 @@ def _reciprocal_rank_fusion(
|
||||
RetrievedChunk(
|
||||
chunk_id=chunk_map[ck].chunk_id,
|
||||
doc_id=chunk_map[ck].doc_id,
|
||||
doc_name=chunk_map[ck].doc_name,
|
||||
content=chunk_map[ck].content,
|
||||
doc_title=chunk_map[ck].doc_title,
|
||||
text=chunk_map[ck].text,
|
||||
score=scores[ck],
|
||||
chunk_type=chunk_map[ck].chunk_type,
|
||||
section_title=chunk_map[ck].section_title,
|
||||
page_number=chunk_map[ck].page_number,
|
||||
page_start=chunk_map[ck].page_start,
|
||||
page_end=chunk_map[ck].page_end,
|
||||
section_level=chunk_map[ck].section_level,
|
||||
chunk_index=chunk_map[ck].chunk_index,
|
||||
piece_index=chunk_map[ck].piece_index,
|
||||
metadata=chunk_map[ck].metadata,
|
||||
)
|
||||
for ck in sorted_keys
|
||||
|
||||
1
backend/app/application/perception/__init__.py
Normal file
1
backend/app/application/perception/__init__.py
Normal file
@@ -0,0 +1 @@
|
||||
"""Perception application package."""
|
||||
143
backend/app/application/perception/services.py
Normal file
143
backend/app/application/perception/services.py
Normal file
@@ -0,0 +1,143 @@
|
||||
"""Perception application service — event listing and streaming impact analysis."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
from typing import Generator
|
||||
|
||||
from app.application.knowledge.services import KnowledgeRetrievalService
|
||||
from app.infrastructure.perception.mock_event_store import MockEventStore
|
||||
from app.services.llm.llm_factory import get_llm_client
|
||||
from app.config.settings import settings
|
||||
|
||||
|
||||
_ANALYSIS_SYSTEM_PROMPT = (
|
||||
"你是汽车行业法规合规专家,专注于中国国家标准(GB)、国际法规(UN-ECE、ISO)"
|
||||
"及欧盟法规(EUR-Lex)的解读与合规建议。回答需专业、简洁、结构清晰。"
|
||||
)
|
||||
|
||||
|
||||
class PerceptionService:
|
||||
"""Orchestrate regulatory event queries and streaming impact analysis."""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
event_store: MockEventStore,
|
||||
retrieval_service: KnowledgeRetrievalService,
|
||||
) -> None:
|
||||
self._store = event_store
|
||||
self._retrieval = retrieval_service
|
||||
|
||||
# ------------------------------------------------------------------
|
||||
# Queries
|
||||
# ------------------------------------------------------------------
|
||||
|
||||
def list_events(
|
||||
self,
|
||||
*,
|
||||
source: str | None = None,
|
||||
impact_level: str | None = None,
|
||||
limit: int = 50,
|
||||
) -> list[dict]:
|
||||
return self._store.filter(source=source, impact_level=impact_level, limit=limit)
|
||||
|
||||
def get_event(self, event_id: str) -> dict | None:
|
||||
return self._store.get(event_id)
|
||||
|
||||
def get_stats(self) -> dict:
|
||||
return self._store.stats()
|
||||
|
||||
# ------------------------------------------------------------------
|
||||
# Streaming analysis
|
||||
# ------------------------------------------------------------------
|
||||
|
||||
def analyze_event(self, event_id: str) -> Generator[dict, None, None]:
|
||||
"""Yield SSE-ready dicts: sources → content chunks → done."""
|
||||
event = self._store.get(event_id)
|
||||
if not event:
|
||||
yield {"event": "error", "data": f"事件 {event_id} 不存在"}
|
||||
return
|
||||
|
||||
# --- 1. RAG retrieval: find related library documents ---
|
||||
query = event["title"] + " " + " ".join(event["tags"])
|
||||
chunks: list = []
|
||||
affected_docs: list[dict] = []
|
||||
try:
|
||||
chunks = self._retrieval.retrieve(query=query, top_k=5)
|
||||
seen: set[str] = set()
|
||||
for chunk in chunks:
|
||||
if chunk.doc_id not in seen:
|
||||
seen.add(chunk.doc_id)
|
||||
affected_docs.append(
|
||||
{
|
||||
"doc_id": chunk.doc_id,
|
||||
"doc_title": chunk.doc_title,
|
||||
"score": round(float(chunk.score), 4),
|
||||
"snippet": (chunk.text or "")[:180],
|
||||
"clause": getattr(chunk, "section_title", "") or "",
|
||||
}
|
||||
)
|
||||
except Exception: # noqa: BLE001
|
||||
pass
|
||||
|
||||
yield {"event": "sources", "data": json.dumps(affected_docs, ensure_ascii=False)}
|
||||
|
||||
# --- 2. Build context from retrieved chunks ---
|
||||
context_parts = [
|
||||
f"[文档{i}: {c.doc_title}]\n{(c.text or '')[:400]}"
|
||||
for i, c in enumerate(chunks[:5], 1)
|
||||
]
|
||||
context = "\n\n".join(context_parts) if context_parts else "(知识库中暂无相关文档)"
|
||||
|
||||
# --- 3. Build prompt ---
|
||||
effective = event.get("effective_at") or "待定"
|
||||
user_content = f"""请对以下法规动态进行专业影响分析。
|
||||
|
||||
【法规动态】
|
||||
标准编号:{event['standard_code']}
|
||||
标题:{event['title']}
|
||||
来源:{event['source_label']}
|
||||
摘要:{event['summary']}
|
||||
生效日期:{effective}
|
||||
分类:{event['category']}
|
||||
关键词:{', '.join(event['tags'])}
|
||||
|
||||
【知识库关联文档】
|
||||
{context}
|
||||
|
||||
请用 Markdown 格式,从以下四个维度进行分析:
|
||||
|
||||
## 核心变化
|
||||
列出本次法规更新最关键的 3-5 项变化(用 - 列表)
|
||||
|
||||
## 业务影响
|
||||
分析对现有产品、认证流程、技术文档的具体影响
|
||||
|
||||
## 整改建议
|
||||
给出优先级排序的行动清单(标注 🔴高 🟡中 🟢低 优先级)
|
||||
|
||||
## 时间节点
|
||||
关键合规时间表与里程碑提醒"""
|
||||
|
||||
messages = [
|
||||
{"role": "system", "content": _ANALYSIS_SYSTEM_PROMPT},
|
||||
{"role": "user", "content": user_content},
|
||||
]
|
||||
|
||||
# --- 4. Stream LLM response ---
|
||||
try:
|
||||
client = get_llm_client(
|
||||
provider=settings.llm_provider,
|
||||
model=settings.llm_model,
|
||||
)
|
||||
if hasattr(client, "stream_chat"):
|
||||
for chunk in client.stream_chat(messages):
|
||||
yield {"event": "content", "data": chunk}
|
||||
else:
|
||||
response = client.chat(messages)
|
||||
yield {"event": "content", "data": response.content or ""}
|
||||
except Exception as exc: # noqa: BLE001
|
||||
yield {"event": "error", "data": str(exc)}
|
||||
return
|
||||
|
||||
yield {"event": "done", "data": "{}"}
|
||||
@@ -33,7 +33,7 @@ class Settings(BaseSettings):
|
||||
# Keep configuration setup explicit so runtime behavior is easy to reason about.
|
||||
milvus_host: str = Field(default="6.86.80.8", description="Milvus服务地址")
|
||||
milvus_port: int = Field(default=19530, description="Milvus服务端口")
|
||||
milvus_collection: str = Field(default="regulations_dense_1024_v1", description="法规向量集合名称")
|
||||
milvus_collection: str = Field(default="regulations_dense_1024_v2", description="法规向量集合名称")
|
||||
milvus_db_name: str = Field(default="default", description="Milvus数据库名称")
|
||||
|
||||
# Keep configuration setup explicit so runtime behavior is easy to reason about.
|
||||
@@ -78,6 +78,7 @@ class Settings(BaseSettings):
|
||||
chunk_overlap: int = Field(default=50, description="分块重叠大小")
|
||||
max_file_size_mb: int = Field(default=100, description="最大文件大小(MB)")
|
||||
document_metadata_path: str = Field(default="backend/data/documents.json", description="文档元数据存储路径")
|
||||
document_processing_metadata_path: str = Field(default="backend/data/document_processing.json", description="文档处理历史存储路径")
|
||||
parser_backend: str = Field(default="aliyun", description="解析后端(local/aliyun)")
|
||||
chunk_backend: str = Field(default="aliyun", description="分块后端(local/aliyun)")
|
||||
document_repository_backend: str = Field(default="json", description="文档元数据存储后端 (json/postgres)")
|
||||
|
||||
@@ -27,7 +27,7 @@ class Settings(BaseSettings):
|
||||
# Milvus
|
||||
milvus_host: str = "6.86.80.8"
|
||||
milvus_port: int = 19530
|
||||
milvus_collection: str = "regulations_dense_1024_v1"
|
||||
milvus_collection: str = "regulations_dense_1024_v2"
|
||||
|
||||
# LLM / embedding defaults aligned with the migrated backend path.
|
||||
llm_model: str = "qwen-max"
|
||||
@@ -47,7 +47,7 @@ class Settings(BaseSettings):
|
||||
api_port: int = 8000
|
||||
|
||||
# Legacy aliases retained for old utility modules.
|
||||
regulations_collection: str = "regulations_dense_1024_v1"
|
||||
regulations_collection: str = "regulations_dense_1024_v2"
|
||||
compliance_collection: str = "compliance_cache"
|
||||
|
||||
# Preserve the legacy module API while keeping env resolution centralized at the repo root.
|
||||
|
||||
@@ -8,18 +8,91 @@ from typing import Any
|
||||
|
||||
|
||||
|
||||
@dataclass
|
||||
@dataclass(init=False)
|
||||
class AnswerSource:
|
||||
"""Represent answer source data."""
|
||||
"""Represent answer source data with legacy aliases."""
|
||||
|
||||
doc_id: str
|
||||
doc_name: str
|
||||
doc_title: str
|
||||
chunk_id: str
|
||||
chunk_type: str
|
||||
section_title: str
|
||||
page_number: int
|
||||
page_start: int
|
||||
page_end: int
|
||||
section_level: int
|
||||
chunk_index: int
|
||||
piece_index: int
|
||||
score: float
|
||||
content: str
|
||||
text: str
|
||||
metadata: dict[str, Any] = field(default_factory=dict)
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
*,
|
||||
doc_id: str,
|
||||
doc_title: str | None = None,
|
||||
chunk_id: str,
|
||||
chunk_type: str = "",
|
||||
section_title: str = "",
|
||||
page_start: int = 0,
|
||||
page_end: int = 0,
|
||||
section_level: int = 0,
|
||||
chunk_index: int = 0,
|
||||
piece_index: int = 0,
|
||||
score: float = 0.0,
|
||||
text: str | None = None,
|
||||
metadata: dict[str, Any] | None = None,
|
||||
doc_name: str | None = None,
|
||||
content: str | None = None,
|
||||
page_number: int | None = None,
|
||||
**_: Any,
|
||||
) -> None:
|
||||
"""Initialize the answer source while accepting legacy field names."""
|
||||
self.doc_id = doc_id
|
||||
self.doc_title = doc_title if doc_title is not None else (doc_name or "")
|
||||
self.chunk_id = chunk_id
|
||||
self.chunk_type = chunk_type
|
||||
self.section_title = section_title
|
||||
self.page_start = int(page_start or page_number or 0)
|
||||
self.page_end = int(page_end or self.page_start)
|
||||
self.section_level = int(section_level or 0)
|
||||
self.chunk_index = int(chunk_index or 0)
|
||||
self.piece_index = int(piece_index or 0)
|
||||
self.score = float(score)
|
||||
self.text = text if text is not None else (content or "")
|
||||
self.metadata = dict(metadata or {})
|
||||
|
||||
@property
|
||||
def doc_name(self) -> str:
|
||||
"""Return the legacy document name alias."""
|
||||
return self.doc_title
|
||||
|
||||
@doc_name.setter
|
||||
def doc_name(self, value: str) -> None:
|
||||
"""Update the legacy document name alias."""
|
||||
self.doc_title = value
|
||||
|
||||
@property
|
||||
def content(self) -> str:
|
||||
"""Return the legacy content alias."""
|
||||
return self.text
|
||||
|
||||
@content.setter
|
||||
def content(self, value: str) -> None:
|
||||
"""Update the legacy content alias."""
|
||||
self.text = value
|
||||
|
||||
@property
|
||||
def page_number(self) -> int:
|
||||
"""Return the legacy page number alias."""
|
||||
return self.page_start
|
||||
|
||||
@page_number.setter
|
||||
def page_number(self, value: int) -> None:
|
||||
"""Update the legacy page number alias."""
|
||||
self.page_start = value
|
||||
self.page_end = max(self.page_end, value)
|
||||
|
||||
|
||||
@dataclass
|
||||
class ConversationMessage:
|
||||
|
||||
@@ -1,18 +1,29 @@
|
||||
"""Initialize the app.domain.documents package."""
|
||||
|
||||
from .models import Chunk, Document, DocumentStatus, ParsedDocument
|
||||
from .ports import ChunkBuilder, DocumentBinaryStore, DocumentParser, DocumentRepository, ParseArtifactStore
|
||||
from .models import Chunk, Document, DocumentArtifact, DocumentProcessingRun, DocumentStatus, DocumentStatusEvent, ParsedDocument
|
||||
from .ports import (
|
||||
ChunkBuilder,
|
||||
DocumentBinaryStore,
|
||||
DocumentParser,
|
||||
DocumentProcessingStore,
|
||||
DocumentRepository,
|
||||
ParseArtifactStore,
|
||||
)
|
||||
# Keep package boundaries explicit so backend imports stay predictable.
|
||||
|
||||
|
||||
__all__ = [
|
||||
"Chunk",
|
||||
"Document",
|
||||
"DocumentArtifact",
|
||||
"DocumentProcessingRun",
|
||||
"DocumentStatus",
|
||||
"DocumentStatusEvent",
|
||||
"ParsedDocument",
|
||||
"ChunkBuilder",
|
||||
"DocumentBinaryStore",
|
||||
"DocumentParser",
|
||||
"DocumentProcessingStore",
|
||||
"DocumentRepository",
|
||||
"ParseArtifactStore",
|
||||
]
|
||||
|
||||
@@ -60,19 +60,171 @@ class ParsedDocument:
|
||||
metadata: dict[str, Any] = field(default_factory=dict)
|
||||
|
||||
|
||||
@dataclass
|
||||
@dataclass(init=False)
|
||||
class Chunk:
|
||||
"""Represent the Chunk type."""
|
||||
"""Represent one retrieval chunk with backward-compatible aliases."""
|
||||
|
||||
chunk_id: str
|
||||
doc_id: str
|
||||
doc_name: str
|
||||
content: str
|
||||
doc_title: str
|
||||
text: str
|
||||
embedding_text: str
|
||||
chunk_type: str = ""
|
||||
chunk_index: int = 0
|
||||
piece_index: int = 0
|
||||
page_start: int = 0
|
||||
page_end: int = 0
|
||||
section_title: str = ""
|
||||
section_path: list[str] = field(default_factory=list)
|
||||
page_number: int = 0
|
||||
section_level: int = 0
|
||||
source_ids: list[str] = field(default_factory=list)
|
||||
regulation_type: str = ""
|
||||
version: str = ""
|
||||
semantic_id: str = ""
|
||||
block_type: str = ""
|
||||
metadata: dict[str, Any] = field(default_factory=dict)
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
*,
|
||||
chunk_id: str,
|
||||
doc_id: str,
|
||||
doc_title: str | None = None,
|
||||
text: str | None = None,
|
||||
embedding_text: str = "",
|
||||
chunk_type: str = "",
|
||||
chunk_index: int = 0,
|
||||
piece_index: int = 0,
|
||||
page_start: int = 0,
|
||||
page_end: int = 0,
|
||||
section_title: str = "",
|
||||
section_path: list[str] | None = None,
|
||||
section_level: int = 0,
|
||||
source_ids: list[str] | None = None,
|
||||
regulation_type: str = "",
|
||||
version: str = "",
|
||||
semantic_id: str = "",
|
||||
metadata: dict[str, Any] | None = None,
|
||||
doc_name: str | None = None,
|
||||
content: str | None = None,
|
||||
page_number: int | None = None,
|
||||
block_type: str | None = None,
|
||||
**_: Any,
|
||||
) -> None:
|
||||
"""Initialize the chunk while accepting legacy field names."""
|
||||
self.chunk_id = chunk_id
|
||||
self.doc_id = doc_id
|
||||
self.doc_title = doc_title if doc_title is not None else (doc_name or "")
|
||||
self.text = text if text is not None else (content or "")
|
||||
self.embedding_text = embedding_text or self.text
|
||||
self.chunk_type = chunk_type or (block_type or "")
|
||||
self.chunk_index = int(chunk_index or 0)
|
||||
self.piece_index = int(piece_index or 0)
|
||||
self.page_start = int(page_start or page_number or 0)
|
||||
self.page_end = int(page_end or self.page_start)
|
||||
self.section_title = section_title
|
||||
self.section_path = list(section_path or [])
|
||||
self.section_level = int(section_level or 0)
|
||||
self.source_ids = list(source_ids or [])
|
||||
self.regulation_type = regulation_type
|
||||
self.version = version
|
||||
self.semantic_id = semantic_id
|
||||
self.metadata = dict(metadata or {})
|
||||
|
||||
@property
|
||||
def doc_name(self) -> str:
|
||||
"""Return the legacy document name alias."""
|
||||
return self.doc_title
|
||||
|
||||
@doc_name.setter
|
||||
def doc_name(self, value: str) -> None:
|
||||
"""Update the legacy document name alias."""
|
||||
self.doc_title = value
|
||||
|
||||
@property
|
||||
def content(self) -> str:
|
||||
"""Return the legacy content alias."""
|
||||
return self.text
|
||||
|
||||
@content.setter
|
||||
def content(self, value: str) -> None:
|
||||
"""Update the legacy content alias."""
|
||||
self.text = value
|
||||
|
||||
@property
|
||||
def page_number(self) -> int:
|
||||
"""Return the legacy page number alias."""
|
||||
return self.page_start
|
||||
|
||||
@page_number.setter
|
||||
def page_number(self, value: int) -> None:
|
||||
"""Update the legacy page number alias."""
|
||||
self.page_start = value
|
||||
self.page_end = max(self.page_end, value)
|
||||
|
||||
@property
|
||||
def block_type(self) -> str:
|
||||
"""Return the legacy block type alias."""
|
||||
return self.chunk_type
|
||||
|
||||
@block_type.setter
|
||||
def block_type(self, value: str) -> None:
|
||||
"""Update the legacy block type alias."""
|
||||
self.chunk_type = value
|
||||
|
||||
|
||||
@dataclass
|
||||
class DocumentProcessingRun:
|
||||
"""Represent one processing attempt for a document."""
|
||||
|
||||
run_id: str
|
||||
doc_id: str
|
||||
trigger_type: str
|
||||
run_status: str
|
||||
parser_backend: str = ""
|
||||
chunk_backend: str = ""
|
||||
embedding_model: str = ""
|
||||
index_name: str = ""
|
||||
started_at: datetime = field(default_factory=utcnow)
|
||||
stored_at: datetime | None = None
|
||||
parsed_at: datetime | None = None
|
||||
indexed_at: datetime | None = None
|
||||
finished_at: datetime | None = None
|
||||
layout_count: int = 0
|
||||
structure_node_count: int = 0
|
||||
semantic_block_count: int = 0
|
||||
vector_chunk_count: int = 0
|
||||
chunk_count: int = 0
|
||||
failure_stage: str = ""
|
||||
error_message: str = ""
|
||||
metadata: dict[str, Any] = field(default_factory=dict)
|
||||
|
||||
|
||||
@dataclass
|
||||
class DocumentStatusEvent:
|
||||
"""Represent a document lifecycle event emitted during processing."""
|
||||
|
||||
event_id: str
|
||||
doc_id: str
|
||||
run_id: str
|
||||
from_status: str
|
||||
to_status: str
|
||||
stage: str
|
||||
message: str = ""
|
||||
metadata: dict[str, Any] = field(default_factory=dict)
|
||||
occurred_at: datetime = field(default_factory=utcnow)
|
||||
|
||||
|
||||
@dataclass
|
||||
class DocumentArtifact:
|
||||
"""Represent a persisted artifact reference for one processing run."""
|
||||
|
||||
artifact_id: str
|
||||
doc_id: str
|
||||
run_id: str
|
||||
artifact_type: str
|
||||
object_name: str
|
||||
content_type: str
|
||||
byte_size: int = 0
|
||||
checksum: str = ""
|
||||
metadata: dict[str, Any] = field(default_factory=dict)
|
||||
created_at: datetime = field(default_factory=utcnow)
|
||||
|
||||
@@ -4,7 +4,7 @@ from __future__ import annotations
|
||||
|
||||
from abc import ABC, abstractmethod
|
||||
|
||||
from .models import Chunk, Document, DocumentStatus, ParsedDocument
|
||||
from .models import Chunk, Document, DocumentArtifact, DocumentProcessingRun, DocumentStatus, DocumentStatusEvent, ParsedDocument
|
||||
# Keep domain contracts explicit so adapters can swap implementations cleanly.
|
||||
|
||||
|
||||
@@ -128,3 +128,111 @@ class ParseArtifactStore(ABC):
|
||||
def get_structure_nodes(self, doc_id: str) -> list[dict]:
|
||||
"""Return all structure nodes for a document."""
|
||||
pass
|
||||
|
||||
|
||||
class DocumentProcessingStore(ABC):
|
||||
"""Persist document processing runs, events, and artifact references."""
|
||||
|
||||
@abstractmethod
|
||||
def create_run(self, run: DocumentProcessingRun) -> DocumentProcessingRun:
|
||||
"""Create a new processing run record."""
|
||||
pass
|
||||
|
||||
@abstractmethod
|
||||
def mark_run_stored(
|
||||
self,
|
||||
run_id: str,
|
||||
*,
|
||||
stored_at: object | None = None,
|
||||
metadata: dict | None = None,
|
||||
) -> DocumentProcessingRun | None:
|
||||
"""Mark a run as having persisted the source file."""
|
||||
pass
|
||||
|
||||
@abstractmethod
|
||||
def mark_run_parsed(
|
||||
self,
|
||||
run_id: str,
|
||||
*,
|
||||
parser_backend: str,
|
||||
layout_count: int,
|
||||
structure_node_count: int,
|
||||
semantic_block_count: int,
|
||||
vector_chunk_count: int,
|
||||
parsed_at: object | None = None,
|
||||
metadata: dict | None = None,
|
||||
) -> DocumentProcessingRun | None:
|
||||
"""Record parse completion details for a run."""
|
||||
pass
|
||||
|
||||
@abstractmethod
|
||||
def mark_run_indexed(
|
||||
self,
|
||||
run_id: str,
|
||||
*,
|
||||
chunk_count: int,
|
||||
index_name: str,
|
||||
indexed_at: object | None = None,
|
||||
finished_at: object | None = None,
|
||||
metadata: dict | None = None,
|
||||
) -> DocumentProcessingRun | None:
|
||||
"""Mark a run as successfully indexed."""
|
||||
pass
|
||||
|
||||
@abstractmethod
|
||||
def mark_run_failed(
|
||||
self,
|
||||
run_id: str,
|
||||
*,
|
||||
failure_stage: str,
|
||||
error_message: str,
|
||||
finished_at: object | None = None,
|
||||
metadata: dict | None = None,
|
||||
) -> DocumentProcessingRun | None:
|
||||
"""Mark a run as failed."""
|
||||
pass
|
||||
|
||||
@abstractmethod
|
||||
def append_status_event(self, event: DocumentStatusEvent) -> DocumentStatusEvent:
|
||||
"""Append a document status event."""
|
||||
pass
|
||||
|
||||
@abstractmethod
|
||||
def replace_artifacts_for_run(self, run_id: str, artifacts: list[DocumentArtifact]) -> list[DocumentArtifact]:
|
||||
"""Replace all artifacts for a run with the provided list."""
|
||||
pass
|
||||
|
||||
@abstractmethod
|
||||
def delete_by_document(self, doc_id: str) -> None:
|
||||
"""Delete all processing data for a document."""
|
||||
pass
|
||||
|
||||
@abstractmethod
|
||||
def list_runs_by_document(self, doc_id: str) -> list[DocumentProcessingRun]:
|
||||
"""List all processing runs for a document."""
|
||||
pass
|
||||
|
||||
@abstractmethod
|
||||
def get_run(self, run_id: str) -> DocumentProcessingRun | None:
|
||||
"""Return one processing run by identifier."""
|
||||
pass
|
||||
|
||||
@abstractmethod
|
||||
def list_status_events_by_document(self, doc_id: str) -> list[DocumentStatusEvent]:
|
||||
"""List status events for a document."""
|
||||
pass
|
||||
|
||||
@abstractmethod
|
||||
def list_status_events_by_run(self, run_id: str) -> list[DocumentStatusEvent]:
|
||||
"""List status events for a run."""
|
||||
pass
|
||||
|
||||
@abstractmethod
|
||||
def list_artifacts_by_document(self, doc_id: str) -> list[DocumentArtifact]:
|
||||
"""List artifact references for a document."""
|
||||
pass
|
||||
|
||||
@abstractmethod
|
||||
def list_artifacts_by_run(self, run_id: str) -> list[DocumentArtifact]:
|
||||
"""List artifact references for a run."""
|
||||
pass
|
||||
|
||||
@@ -16,14 +16,88 @@ class RetrievalQuery:
|
||||
filters: str | None = None
|
||||
|
||||
|
||||
@dataclass
|
||||
@dataclass(init=False)
|
||||
class RetrievedChunk:
|
||||
"""Represent the Retrieved Chunk type."""
|
||||
"""Represent the retrieved chunk payload with legacy aliases."""
|
||||
|
||||
chunk_id: str
|
||||
doc_id: str
|
||||
doc_name: str
|
||||
content: str
|
||||
doc_title: str
|
||||
text: str
|
||||
score: float
|
||||
chunk_type: str = ""
|
||||
section_title: str = ""
|
||||
page_number: int = 0
|
||||
page_start: int = 0
|
||||
page_end: int = 0
|
||||
section_level: int = 0
|
||||
chunk_index: int = 0
|
||||
piece_index: int = 0
|
||||
metadata: dict[str, Any] = field(default_factory=dict)
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
*,
|
||||
chunk_id: str,
|
||||
doc_id: str,
|
||||
doc_title: str | None = None,
|
||||
text: str | None = None,
|
||||
score: float = 0.0,
|
||||
chunk_type: str = "",
|
||||
section_title: str = "",
|
||||
page_start: int = 0,
|
||||
page_end: int = 0,
|
||||
section_level: int = 0,
|
||||
chunk_index: int = 0,
|
||||
piece_index: int = 0,
|
||||
metadata: dict[str, Any] | None = None,
|
||||
doc_name: str | None = None,
|
||||
content: str | None = None,
|
||||
page_number: int | None = None,
|
||||
block_type: str | None = None,
|
||||
**_: Any,
|
||||
) -> None:
|
||||
"""Initialize the retrieved chunk while accepting legacy field names."""
|
||||
self.chunk_id = chunk_id
|
||||
self.doc_id = doc_id
|
||||
self.doc_title = doc_title if doc_title is not None else (doc_name or "")
|
||||
self.text = text if text is not None else (content or "")
|
||||
self.score = float(score)
|
||||
self.chunk_type = chunk_type or (block_type or "")
|
||||
self.section_title = section_title
|
||||
self.page_start = int(page_start or page_number or 0)
|
||||
self.page_end = int(page_end or self.page_start)
|
||||
self.section_level = int(section_level or 0)
|
||||
self.chunk_index = int(chunk_index or 0)
|
||||
self.piece_index = int(piece_index or 0)
|
||||
self.metadata = dict(metadata or {})
|
||||
|
||||
@property
|
||||
def doc_name(self) -> str:
|
||||
"""Return the legacy document name alias."""
|
||||
return self.doc_title
|
||||
|
||||
@doc_name.setter
|
||||
def doc_name(self, value: str) -> None:
|
||||
"""Update the legacy document name alias."""
|
||||
self.doc_title = value
|
||||
|
||||
@property
|
||||
def content(self) -> str:
|
||||
"""Return the legacy content alias."""
|
||||
return self.text
|
||||
|
||||
@content.setter
|
||||
def content(self, value: str) -> None:
|
||||
"""Update the legacy content alias."""
|
||||
self.text = value
|
||||
|
||||
@property
|
||||
def page_number(self) -> int:
|
||||
"""Return the legacy page number alias."""
|
||||
return self.page_start
|
||||
|
||||
@page_number.setter
|
||||
def page_number(self, value: int) -> None:
|
||||
"""Update the legacy page number alias."""
|
||||
self.page_start = value
|
||||
self.page_end = max(self.page_end, value)
|
||||
|
||||
@@ -45,10 +45,10 @@ class OpenAICompatibleAnswerGenerator(AnswerGenerator):
|
||||
context_tokens = 0
|
||||
for idx, chunk in enumerate(retrieved_chunks, start=1):
|
||||
block = (
|
||||
f"[{idx}] 文档: {chunk.doc_name}\n"
|
||||
f"[{idx}] 文档: {chunk.doc_title}\n"
|
||||
f"章节: {chunk.section_title or '未标注'}\n"
|
||||
f"页码: {chunk.page_number}\n"
|
||||
f"内容: {chunk.content}"
|
||||
f"页码: {chunk.page_start}" + (f"-{chunk.page_end}" if chunk.page_end and chunk.page_end != chunk.page_start else "") + "\n"
|
||||
f"内容: {chunk.text}"
|
||||
)
|
||||
block_tokens = self._estimate_tokens(block)
|
||||
if context_tokens + block_tokens > settings.rag_max_context_tokens:
|
||||
@@ -67,17 +67,37 @@ class OpenAICompatibleAnswerGenerator(AnswerGenerator):
|
||||
)
|
||||
return messages, context_tokens
|
||||
|
||||
def _is_context_truncated(self, *, retrieved_chunks: list[RetrievedChunk], context_tokens: int) -> bool:
|
||||
"""Return whether the prompt context had to omit retrieved chunks to fit the token budget."""
|
||||
if not retrieved_chunks:
|
||||
return False
|
||||
estimated_total_tokens = sum(
|
||||
self._estimate_tokens(
|
||||
f"[{idx}] 文档: {chunk.doc_title}\n"
|
||||
f"章节: {chunk.section_title or '未标注'}\n"
|
||||
f"页码: {chunk.page_start}" + (f"-{chunk.page_end}" if chunk.page_end and chunk.page_end != chunk.page_start else "") + "\n"
|
||||
f"内容: {chunk.text}"
|
||||
)
|
||||
for idx, chunk in enumerate(retrieved_chunks, start=1)
|
||||
)
|
||||
return estimated_total_tokens > context_tokens
|
||||
|
||||
def _sources(self, chunks: list[RetrievedChunk]) -> list[AnswerSource]:
|
||||
"""Handle sources for this module for the Open A I Compatible Answer Generator instance."""
|
||||
return [
|
||||
AnswerSource(
|
||||
doc_id=chunk.doc_id,
|
||||
doc_name=chunk.doc_name,
|
||||
doc_title=chunk.doc_title,
|
||||
chunk_id=chunk.chunk_id,
|
||||
chunk_type=chunk.chunk_type,
|
||||
section_title=chunk.section_title,
|
||||
page_number=chunk.page_number,
|
||||
page_start=chunk.page_start,
|
||||
page_end=chunk.page_end,
|
||||
section_level=chunk.section_level,
|
||||
chunk_index=chunk.chunk_index,
|
||||
piece_index=chunk.piece_index,
|
||||
score=chunk.score,
|
||||
content=chunk.content,
|
||||
text=chunk.text,
|
||||
metadata=chunk.metadata,
|
||||
)
|
||||
for chunk in chunks
|
||||
@@ -111,7 +131,10 @@ class OpenAICompatibleAnswerGenerator(AnswerGenerator):
|
||||
latency_ms=latency_ms,
|
||||
retrieved_count=len(retrieved_chunks),
|
||||
context_tokens=context_tokens,
|
||||
truncated=len(retrieved_chunks) > len(messages),
|
||||
truncated=self._is_context_truncated(
|
||||
retrieved_chunks=retrieved_chunks,
|
||||
context_tokens=context_tokens,
|
||||
),
|
||||
error=response.error,
|
||||
)
|
||||
|
||||
|
||||
@@ -10,6 +10,7 @@ class LocalRegulationChunkBuilder(ChunkBuilder):
|
||||
"""Adapt the existing markdown chunker to the new chunk builder port."""
|
||||
|
||||
def __init__(self, *, chunk_size: int = 512, chunk_overlap: int = 50) -> None:
|
||||
"""Initialize the local markdown chunk builder."""
|
||||
self.chunker = RegulationChunker(
|
||||
chunk_size=chunk_size,
|
||||
chunk_overlap=chunk_overlap,
|
||||
@@ -22,6 +23,7 @@ class LocalRegulationChunkBuilder(ChunkBuilder):
|
||||
regulation_type: str,
|
||||
version: str,
|
||||
) -> list[Chunk]:
|
||||
"""Build migrated chunk objects from the legacy markdown chunker output."""
|
||||
markdown_text = parsed_document.raw_text.strip()
|
||||
if not markdown_text:
|
||||
return []
|
||||
@@ -50,16 +52,18 @@ class LocalRegulationChunkBuilder(ChunkBuilder):
|
||||
Chunk(
|
||||
chunk_id=item.metadata.chunk_id,
|
||||
doc_id=parsed_document.doc_id,
|
||||
doc_name=parsed_document.doc_name,
|
||||
content=item.content,
|
||||
doc_title=parsed_document.doc_name,
|
||||
text=item.content,
|
||||
embedding_text=item.content,
|
||||
chunk_type="local_markdown_chunk",
|
||||
section_title=item.metadata.section_title or item.metadata.section_number,
|
||||
section_path=section_path,
|
||||
page_number=item.metadata.page_number,
|
||||
page_start=item.metadata.page_number,
|
||||
page_end=item.metadata.page_number,
|
||||
section_level=len(section_path),
|
||||
regulation_type=regulation_type,
|
||||
version=version,
|
||||
semantic_id=item.metadata.clause_number,
|
||||
block_type="local_markdown_chunk",
|
||||
metadata=metadata,
|
||||
)
|
||||
)
|
||||
|
||||
@@ -19,29 +19,35 @@ class AliyunVectorChunkBuilder(ChunkBuilder):
|
||||
"""Handle build for the Aliyun Vector Chunk Builder instance."""
|
||||
chunks: list[Chunk] = []
|
||||
for index, item in enumerate(parsed_document.vector_chunks):
|
||||
content = item.get("content") or item.get("text") or ""
|
||||
embedding_text = item.get("embedding_text") or content
|
||||
text = item.get("text") or ""
|
||||
embedding_text = item.get("embedding_text") or text
|
||||
if not embedding_text.strip():
|
||||
continue
|
||||
section_path = item.get("section_path") or []
|
||||
section_title = item.get("section_title") or (section_path[-1] if section_path else "")
|
||||
page_number = item.get("page_start") or item.get("page") or 0
|
||||
chunk_id = item.get("chunk_id") or f"{parsed_document.doc_id}-chunk-{index}"
|
||||
metadata = {k: v for k, v in item.items() if k not in {"content", "embedding_text"}}
|
||||
metadata = dict(item)
|
||||
metadata["regulation_type"] = regulation_type
|
||||
metadata["version"] = version
|
||||
chunks.append(
|
||||
Chunk(
|
||||
chunk_id=str(chunk_id),
|
||||
doc_id=parsed_document.doc_id,
|
||||
doc_name=parsed_document.doc_name,
|
||||
content=content,
|
||||
doc_title=str(item.get("doc_title") or parsed_document.doc_name),
|
||||
text=text,
|
||||
embedding_text=embedding_text,
|
||||
chunk_type=str(item.get("chunk_type", item.get("block_type", ""))),
|
||||
chunk_index=int(item.get("chunk_index") or 0),
|
||||
piece_index=int(item.get("piece_index") or 0),
|
||||
page_start=int(item.get("page_start") or 0),
|
||||
page_end=int(item.get("page_end") or 0),
|
||||
section_title=section_title,
|
||||
section_path=section_path,
|
||||
page_number=int(page_number or 0),
|
||||
section_level=int(item.get("section_level") or len(section_path)),
|
||||
source_ids=[str(v) for v in item.get("source_ids", [])],
|
||||
regulation_type=regulation_type,
|
||||
version=version,
|
||||
semantic_id=item.get("semantic_id", ""),
|
||||
block_type=item.get("block_type", ""),
|
||||
metadata=metadata,
|
||||
)
|
||||
)
|
||||
|
||||
1
backend/app/infrastructure/perception/__init__.py
Normal file
1
backend/app/infrastructure/perception/__init__.py
Normal file
@@ -0,0 +1 @@
|
||||
"""Perception infrastructure package."""
|
||||
421
backend/app/infrastructure/perception/mock_event_store.py
Normal file
421
backend/app/infrastructure/perception/mock_event_store.py
Normal file
@@ -0,0 +1,421 @@
|
||||
"""Mock regulatory event store with 20 high-quality pre-seeded events."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from typing import Any
|
||||
|
||||
MOCK_EVENTS: list[dict[str, Any]] = [
|
||||
# ------------------------------------------------------------------ HIGH
|
||||
{
|
||||
"id": "evt-001",
|
||||
"source": "MIIT",
|
||||
"source_label": "工业和信息化部",
|
||||
"standard_code": "GB 18384-2025",
|
||||
"title": "《电动汽车安全要求》国家标准第三版正式发布",
|
||||
"summary": (
|
||||
"新增 IP67 级别高压系统密封防护要求;热失控预警响应时间压缩至 5 分钟;"
|
||||
"调整碰撞安全测试工况,新增侧柱碰工况。本标准于 2026 年 7 月 1 日强制实施。"
|
||||
),
|
||||
"impact_level": "high",
|
||||
"published_at": "2025-11-15",
|
||||
"effective_at": "2026-07-01",
|
||||
"category": "电动汽车安全",
|
||||
"tags": ["电池安全", "高压防护", "碰撞安全", "热失控"],
|
||||
"source_url": "https://openstd.samr.gov.cn",
|
||||
"status": "enacted",
|
||||
},
|
||||
{
|
||||
"id": "evt-002",
|
||||
"source": "UN-ECE",
|
||||
"source_label": "联合国欧洲经委会",
|
||||
"standard_code": "UN R155 Amendment 3",
|
||||
"title": "UN-ECE R155 网络安全法规第三次修订正式生效",
|
||||
"summary": (
|
||||
"新增对 OTA(空中升级)全生命周期的安全审计要求;强化车辆 TARA"
|
||||
"(威胁分析与风险评估)文档化义务;扩展 CSMS 监控范围至售后服务商。"
|
||||
),
|
||||
"impact_level": "high",
|
||||
"published_at": "2026-01-20",
|
||||
"effective_at": "2026-07-01",
|
||||
"category": "网络安全",
|
||||
"tags": ["OTA", "网络安全", "CSMS", "TARA", "R155"],
|
||||
"source_url": "https://unece.org/transport/vehicle-regulations",
|
||||
"status": "enacted",
|
||||
},
|
||||
{
|
||||
"id": "evt-003",
|
||||
"source": "MIIT",
|
||||
"source_label": "工业和信息化部",
|
||||
"standard_code": "GB/T 40429-2026(征求意见稿)",
|
||||
"title": "《汽车整车信息安全技术要求》修订征求意见",
|
||||
"summary": (
|
||||
"增加基于人工智能的异常行为检测要求;新增车云通信双向认证机制规范;"
|
||||
"提出数据最小化原则在车辆 OBD 数据收集中的应用细则。"
|
||||
),
|
||||
"impact_level": "high",
|
||||
"published_at": "2026-03-05",
|
||||
"effective_at": None,
|
||||
"category": "信息安全",
|
||||
"tags": ["信息安全", "数据安全", "AI检测", "OBD"],
|
||||
"source_url": "https://www.miit.gov.cn/",
|
||||
"status": "draft",
|
||||
},
|
||||
{
|
||||
"id": "evt-004",
|
||||
"source": "MIIT",
|
||||
"source_label": "工业和信息化部",
|
||||
"standard_code": "NEV 双积分 2026",
|
||||
"title": "2026 年度新能源汽车双积分管理办法年度调整",
|
||||
"summary": (
|
||||
"纯电动乘用车标准车型积分(CAFC)基准值上调 8%;"
|
||||
"提高 A 级及以上续航里程门槛;新增氢燃料电池商用车积分计算细则。"
|
||||
),
|
||||
"impact_level": "high",
|
||||
"published_at": "2026-02-28",
|
||||
"effective_at": "2026-04-01",
|
||||
"category": "新能源政策",
|
||||
"tags": ["双积分", "纯电动", "燃料电池", "碳配额"],
|
||||
"source_url": "https://www.miit.gov.cn/",
|
||||
"status": "enacted",
|
||||
},
|
||||
{
|
||||
"id": "evt-017",
|
||||
"source": "MIIT",
|
||||
"source_label": "工业和信息化部",
|
||||
"standard_code": "智能网联汽车准入管理办法实施细则",
|
||||
"title": "智能网联汽车准入管理实施细则正式落地",
|
||||
"summary": (
|
||||
"明确 L3 及以上自动驾驶功能的准入申报路径;"
|
||||
"要求 OEM 建立数据安全管理体系并完成等保 2.0 三级认证;"
|
||||
"道路测试数据留存期延长至 3 年。"
|
||||
),
|
||||
"impact_level": "high",
|
||||
"published_at": "2026-03-01",
|
||||
"effective_at": "2026-09-01",
|
||||
"category": "智能网联",
|
||||
"tags": ["智能网联", "L3自动驾驶", "准入管理", "数据留存"],
|
||||
"source_url": "https://www.miit.gov.cn/",
|
||||
"status": "enacted",
|
||||
},
|
||||
{
|
||||
"id": "evt-018",
|
||||
"source": "EUR-Lex",
|
||||
"source_label": "欧盟官方公报",
|
||||
"standard_code": "EU Cyber Resilience Act (CRA)",
|
||||
"title": "《欧盟网络韧性法案》核心条款对车联网设备生效",
|
||||
"summary": (
|
||||
"联网汽车 ECU 须满足 CRA「重要类 II」安全要求;"
|
||||
"强制 SBOM(软件物料清单)公开披露;"
|
||||
"OEM 须提供至少 10 年的漏洞修复支持承诺。"
|
||||
),
|
||||
"impact_level": "high",
|
||||
"published_at": "2026-02-15",
|
||||
"effective_at": "2027-01-01",
|
||||
"category": "网络安全",
|
||||
"tags": ["CRA", "SBOM", "漏洞管理", "网络韧性"],
|
||||
"source_url": "https://eur-lex.europa.eu",
|
||||
"status": "enacted",
|
||||
},
|
||||
# --------------------------------------------------------------- MEDIUM
|
||||
{
|
||||
"id": "evt-005",
|
||||
"source": "UN-ECE",
|
||||
"source_label": "联合国欧洲经委会",
|
||||
"standard_code": "UN R156 Amendment 2",
|
||||
"title": "UN-ECE R156 软件升级与 SUMS 法规补充修订",
|
||||
"summary": (
|
||||
"明确 SUMS(软件更新管理系统)对 ECU 版本追溯的最低保留年限为 15 年;"
|
||||
"新增售后 OTA 推送的用户知情同意要求规范。"
|
||||
),
|
||||
"impact_level": "medium",
|
||||
"published_at": "2026-01-10",
|
||||
"effective_at": "2026-07-01",
|
||||
"category": "软件升级",
|
||||
"tags": ["OTA", "SUMS", "软件版本", "R156"],
|
||||
"source_url": "https://unece.org/transport/vehicle-regulations",
|
||||
"status": "enacted",
|
||||
},
|
||||
{
|
||||
"id": "evt-006",
|
||||
"source": "国标委",
|
||||
"source_label": "国家标准化管理委员会",
|
||||
"standard_code": "GB/T 35273-2026",
|
||||
"title": "《信息安全技术 个人信息安全规范》更新版发布",
|
||||
"summary": (
|
||||
"将车内人脸识别、声纹采集列为敏感个人信息;"
|
||||
"补充自动驾驶场景下乘员行为数据的去标识化技术规范;"
|
||||
"强化数据出境安全评估触发阈值。"
|
||||
),
|
||||
"impact_level": "medium",
|
||||
"published_at": "2025-12-01",
|
||||
"effective_at": "2026-06-01",
|
||||
"category": "数据安全",
|
||||
"tags": ["个人信息", "PIPL", "数据安全", "生物识别"],
|
||||
"source_url": "https://openstd.samr.gov.cn",
|
||||
"status": "enacted",
|
||||
},
|
||||
{
|
||||
"id": "evt-007",
|
||||
"source": "EUR-Lex",
|
||||
"source_label": "欧盟官方公报",
|
||||
"standard_code": "EU AI Act — Art.13 & Art.14",
|
||||
"title": "《欧盟人工智能法案》第13-14条透明度与人工监督条款正式生效",
|
||||
"summary": (
|
||||
"要求在汽车 ADAS 系统中植入 AI 使用记录日志;"
|
||||
"驾驶员监控 AI 系统须披露决策逻辑;"
|
||||
"高风险 AI 系统需提供人工干预接口。"
|
||||
),
|
||||
"impact_level": "medium",
|
||||
"published_at": "2026-02-01",
|
||||
"effective_at": "2026-08-01",
|
||||
"category": "AI 法规",
|
||||
"tags": ["AI法案", "透明度", "ADAS", "高风险AI"],
|
||||
"source_url": "https://eur-lex.europa.eu",
|
||||
"status": "enacted",
|
||||
},
|
||||
{
|
||||
"id": "evt-008",
|
||||
"source": "ISO",
|
||||
"source_label": "国际标准化组织",
|
||||
"standard_code": "ISO 45001:2025 Amd.1",
|
||||
"title": "ISO 45001 职业健康安全管理体系第一次修正",
|
||||
"summary": (
|
||||
"新增心理健康风险纳入 OHS 危害辨识范围;"
|
||||
"明确远程办公人员安全管理职责;"
|
||||
"更新绩效评价指标体系,新增事故未遂事件统计要求。"
|
||||
),
|
||||
"impact_level": "medium",
|
||||
"published_at": "2025-10-20",
|
||||
"effective_at": "2026-01-01",
|
||||
"category": "EHS 管理",
|
||||
"tags": ["ISO 45001", "EHS", "职业健康", "安全管理"],
|
||||
"source_url": "https://www.iso.org",
|
||||
"status": "enacted",
|
||||
},
|
||||
{
|
||||
"id": "evt-009",
|
||||
"source": "MIIT",
|
||||
"source_label": "工业和信息化部",
|
||||
"standard_code": "GB/T 28001-2026(征求意见)",
|
||||
"title": "《汽车产品安全召回管理规程》修订征求意见",
|
||||
"summary": (
|
||||
"扩展召回触发条件,将 OTA 推送导致的功能异常纳入强制报告范围;"
|
||||
"缩短重大安全隐患召回启动时限至 15 个工作日。"
|
||||
),
|
||||
"impact_level": "medium",
|
||||
"published_at": "2026-03-15",
|
||||
"effective_at": None,
|
||||
"category": "召回管理",
|
||||
"tags": ["召回", "OTA", "安全隐患", "产品安全"],
|
||||
"source_url": "https://www.miit.gov.cn/",
|
||||
"status": "draft",
|
||||
},
|
||||
{
|
||||
"id": "evt-010",
|
||||
"source": "国标委",
|
||||
"source_label": "国家标准化管理委员会",
|
||||
"standard_code": "GB 38031-2025",
|
||||
"title": "《电动汽车用动力蓄电池安全要求》修订版发布",
|
||||
"summary": (
|
||||
"新增电池系统针刺、浸水、挤压等极端工况测试程序;"
|
||||
"热扩散防护等级要求升级;"
|
||||
"强化 BMS(电池管理系统)状态监测数据记录要求。"
|
||||
),
|
||||
"impact_level": "medium",
|
||||
"published_at": "2025-09-15",
|
||||
"effective_at": "2026-03-01",
|
||||
"category": "电池安全",
|
||||
"tags": ["动力电池", "BMS", "热扩散", "安全测试"],
|
||||
"source_url": "https://openstd.samr.gov.cn",
|
||||
"status": "enacted",
|
||||
},
|
||||
{
|
||||
"id": "evt-016",
|
||||
"source": "UN-ECE",
|
||||
"source_label": "联合国欧洲经委会",
|
||||
"standard_code": "UN R100 Rev.4(草案)",
|
||||
"title": "UN R100 电动汽车安全认证法规第四次修订草案发布",
|
||||
"summary": (
|
||||
"拟对 400V 以上高压系统的绝缘电阻监测提出实时 CAN 总线传输要求;"
|
||||
"新增极低温工况(-40°C)的电池性能验证程序。"
|
||||
),
|
||||
"impact_level": "medium",
|
||||
"published_at": "2026-04-08",
|
||||
"effective_at": None,
|
||||
"category": "电动汽车安全",
|
||||
"tags": ["R100", "高压安全", "绝缘监测", "低温性能"],
|
||||
"source_url": "https://unece.org/transport/vehicle-regulations",
|
||||
"status": "draft",
|
||||
},
|
||||
{
|
||||
"id": "evt-019",
|
||||
"source": "ISO",
|
||||
"source_label": "国际标准化组织",
|
||||
"standard_code": "ISO/SAE 21434:2026 Amd.1",
|
||||
"title": "ISO/SAE 21434 汽车网络安全工程第一次修正",
|
||||
"summary": (
|
||||
"将 AI 推理组件纳入汽车网络安全工程范围;"
|
||||
"补充端到端加密通信在 V2X 场景中的 TARA 建模要求;"
|
||||
"新增第三方 ECU 供应商 CSMS 审计方法。"
|
||||
),
|
||||
"impact_level": "medium",
|
||||
"published_at": "2026-04-10",
|
||||
"effective_at": "2026-10-01",
|
||||
"category": "网络安全",
|
||||
"tags": ["ISO 21434", "网络安全", "V2X", "AI安全"],
|
||||
"source_url": "https://www.iso.org",
|
||||
"status": "enacted",
|
||||
},
|
||||
# ------------------------------------------------------------------ LOW
|
||||
{
|
||||
"id": "evt-011",
|
||||
"source": "ISO",
|
||||
"source_label": "国际标准化组织",
|
||||
"standard_code": "ISO 26262:2026 Ed.3(征求意见)",
|
||||
"title": "ISO 26262 功能安全第三版征求意见启动",
|
||||
"summary": (
|
||||
"拟新增对 AI/ML 组件功能安全验证方法的指导附录;"
|
||||
"讨论 SOTIF(预期功能安全)与 ISO 26262 的协调融合路径。"
|
||||
),
|
||||
"impact_level": "low",
|
||||
"published_at": "2026-04-01",
|
||||
"effective_at": None,
|
||||
"category": "功能安全",
|
||||
"tags": ["功能安全", "ASIL", "AI安全", "SOTIF"],
|
||||
"source_url": "https://www.iso.org",
|
||||
"status": "consultation",
|
||||
},
|
||||
{
|
||||
"id": "evt-012",
|
||||
"source": "EUR-Lex",
|
||||
"source_label": "欧盟官方公报",
|
||||
"standard_code": "REACH Regulation Update 2026",
|
||||
"title": "欧盟 REACH 法规限制物质清单更新(第 22 批)",
|
||||
"summary": (
|
||||
"新增 3 种 SVHCs(高度关注物质),包括特定阻燃剂和密封材料成分;"
|
||||
"汽车零部件豁免条款调整,影响部分内饰材料供应商。"
|
||||
),
|
||||
"impact_level": "low",
|
||||
"published_at": "2026-01-30",
|
||||
"effective_at": "2026-09-01",
|
||||
"category": "环保法规",
|
||||
"tags": ["REACH", "SVHCs", "环保", "化学品管理"],
|
||||
"source_url": "https://eur-lex.europa.eu",
|
||||
"status": "enacted",
|
||||
},
|
||||
{
|
||||
"id": "evt-013",
|
||||
"source": "MIIT",
|
||||
"source_label": "工业和信息化部",
|
||||
"standard_code": "CCER 汽车碳配额 2026",
|
||||
"title": "自愿减排(CCER)汽车行业核算方法学更新",
|
||||
"summary": (
|
||||
"更新纯电动汽车全生命周期碳排放核算边界;"
|
||||
"新增动力电池回收环节碳减排量认定方法;"
|
||||
"与全国碳市场对接的企业碳账户数据接口规范发布。"
|
||||
),
|
||||
"impact_level": "low",
|
||||
"published_at": "2026-02-10",
|
||||
"effective_at": "2026-06-01",
|
||||
"category": "碳排放",
|
||||
"tags": ["CCER", "碳排放", "碳中和", "碳核算"],
|
||||
"source_url": "https://www.miit.gov.cn/",
|
||||
"status": "enacted",
|
||||
},
|
||||
{
|
||||
"id": "evt-014",
|
||||
"source": "IATF",
|
||||
"source_label": "国际汽车工作组",
|
||||
"standard_code": "IATF 16949:2025 CSR 通告",
|
||||
"title": "IATF 16949 质量管理体系客户特殊要求更新通告",
|
||||
"summary": (
|
||||
"多家主机厂(OEM)同步更新 CSR,涵盖软件定义汽车(SDV)"
|
||||
"场景下的质量过程管控;电子电气 BOM 变更管理流程补充规范。"
|
||||
),
|
||||
"impact_level": "low",
|
||||
"published_at": "2026-03-20",
|
||||
"effective_at": "2026-07-01",
|
||||
"category": "质量管理",
|
||||
"tags": ["IATF 16949", "质量管理", "SDV", "CSR"],
|
||||
"source_url": "https://www.iatfglobaloversight.org",
|
||||
"status": "enacted",
|
||||
},
|
||||
{
|
||||
"id": "evt-015",
|
||||
"source": "国标委",
|
||||
"source_label": "国家标准化管理委员会",
|
||||
"standard_code": "GB 7258-2025 勘误",
|
||||
"title": "《机动车运行安全技术条件》年度勘误发布",
|
||||
"summary": (
|
||||
"更正第 12 章灯光系统技术要求中的参数引用错误;"
|
||||
"澄清前雾灯安装位置尺寸定义;此次为勘误性修订,不影响已认证车型。"
|
||||
),
|
||||
"impact_level": "low",
|
||||
"published_at": "2026-01-05",
|
||||
"effective_at": "2026-01-05",
|
||||
"category": "运行安全",
|
||||
"tags": ["GB 7258", "灯光", "运行安全", "勘误"],
|
||||
"source_url": "https://openstd.samr.gov.cn",
|
||||
"status": "enacted",
|
||||
},
|
||||
{
|
||||
"id": "evt-020",
|
||||
"source": "国标委",
|
||||
"source_label": "国家标准化管理委员会",
|
||||
"standard_code": "GB/T 27930-2026",
|
||||
"title": "《电动汽车非车载传导式充电通信协议》更新版发布",
|
||||
"summary": (
|
||||
"兼容 CHAdeMO 4.0 与 CCS2 双协议栈;"
|
||||
"新增大功率充电(>350kW)通信握手流程;"
|
||||
"强化充电过程 BMS 实时诊断数据上报规范。"
|
||||
),
|
||||
"impact_level": "low",
|
||||
"published_at": "2026-03-25",
|
||||
"effective_at": "2026-12-01",
|
||||
"category": "充电标准",
|
||||
"tags": ["充电协议", "BMS", "大功率充电", "CHAdeMO"],
|
||||
"source_url": "https://openstd.samr.gov.cn",
|
||||
"status": "enacted",
|
||||
},
|
||||
]
|
||||
|
||||
# Index for fast lookup
|
||||
_EVENT_INDEX: dict[str, dict] = {e["id"]: e for e in MOCK_EVENTS}
|
||||
|
||||
|
||||
class MockEventStore:
|
||||
"""In-memory mock store for regulatory events."""
|
||||
|
||||
def all(self) -> list[dict]:
|
||||
return list(MOCK_EVENTS)
|
||||
|
||||
def get(self, event_id: str) -> dict | None:
|
||||
return _EVENT_INDEX.get(event_id)
|
||||
|
||||
def filter(
|
||||
self,
|
||||
*,
|
||||
source: str | None = None,
|
||||
impact_level: str | None = None,
|
||||
limit: int = 50,
|
||||
) -> list[dict]:
|
||||
events = list(MOCK_EVENTS)
|
||||
if source:
|
||||
events = [e for e in events if e["source"] == source]
|
||||
if impact_level:
|
||||
events = [e for e in events if e["impact_level"] == impact_level]
|
||||
events.sort(key=lambda e: e["published_at"], reverse=True)
|
||||
return events[:limit]
|
||||
|
||||
def stats(self) -> dict:
|
||||
from datetime import date, timedelta
|
||||
|
||||
events = MOCK_EVENTS
|
||||
cutoff = (date.today() - timedelta(days=90)).isoformat()
|
||||
return {
|
||||
"total": len(events),
|
||||
"high_impact": sum(1 for e in events if e["impact_level"] == "high"),
|
||||
"medium_impact": sum(1 for e in events if e["impact_level"] == "medium"),
|
||||
"low_impact": sum(1 for e in events if e["impact_level"] == "low"),
|
||||
"recent_90d": sum(1 for e in events if e["published_at"] >= cutoff),
|
||||
}
|
||||
@@ -0,0 +1,373 @@
|
||||
"""Implement infrastructure support for json document processing history."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
from datetime import UTC, datetime
|
||||
from pathlib import Path
|
||||
from typing import Any
|
||||
|
||||
from app.domain.documents import DocumentArtifact, DocumentProcessingRun, DocumentProcessingStore, DocumentStatusEvent
|
||||
# Keep JSON persistence behavior aligned with the lightweight document repository adapter.
|
||||
|
||||
|
||||
class JsonDocumentProcessingStore(DocumentProcessingStore):
|
||||
"""Persist processing history in a standalone JSON file."""
|
||||
|
||||
def __init__(self, file_path: str) -> None:
|
||||
"""Initialize the JSON processing history store."""
|
||||
self.file_path = Path(file_path)
|
||||
self.file_path.parent.mkdir(parents=True, exist_ok=True)
|
||||
if not self.file_path.exists():
|
||||
self._save(self._empty_payload())
|
||||
|
||||
def _empty_payload(self) -> dict[str, dict[str, dict[str, Any]]]:
|
||||
"""Return the canonical empty JSON structure for processing history."""
|
||||
return {"runs": {}, "status_events": {}, "artifacts": {}}
|
||||
|
||||
def _load(self) -> dict[str, dict[str, dict[str, Any]]]:
|
||||
"""Load the full JSON payload and normalize missing sections."""
|
||||
if not self.file_path.exists():
|
||||
return self._empty_payload()
|
||||
payload = json.loads(self.file_path.read_text(encoding="utf-8") or "{}")
|
||||
normalized = self._empty_payload()
|
||||
for key in normalized:
|
||||
section = payload.get(key, {})
|
||||
normalized[key] = section if isinstance(section, dict) else {}
|
||||
return normalized
|
||||
|
||||
def _save(self, payload: dict[str, dict[str, dict[str, Any]]]) -> None:
|
||||
"""Persist the full JSON payload with stable formatting."""
|
||||
self.file_path.write_text(json.dumps(payload, ensure_ascii=False, indent=2), encoding="utf-8")
|
||||
|
||||
def _serialize_datetime(self, value: datetime | None) -> str | None:
|
||||
"""Serialize optional datetimes into ISO8601 strings."""
|
||||
return value.isoformat() if value is not None else None
|
||||
|
||||
def _deserialize_datetime(self, value: str | None) -> datetime | None:
|
||||
"""Deserialize optional ISO8601 strings into datetimes."""
|
||||
return datetime.fromisoformat(value) if value else None
|
||||
|
||||
def _serialize_run(self, run: DocumentProcessingRun) -> dict[str, Any]:
|
||||
"""Serialize one processing run to a JSON-compatible payload."""
|
||||
return {
|
||||
"run_id": run.run_id,
|
||||
"doc_id": run.doc_id,
|
||||
"trigger_type": run.trigger_type,
|
||||
"run_status": run.run_status,
|
||||
"parser_backend": run.parser_backend,
|
||||
"chunk_backend": run.chunk_backend,
|
||||
"embedding_model": run.embedding_model,
|
||||
"index_name": run.index_name,
|
||||
"started_at": self._serialize_datetime(run.started_at),
|
||||
"stored_at": self._serialize_datetime(run.stored_at),
|
||||
"parsed_at": self._serialize_datetime(run.parsed_at),
|
||||
"indexed_at": self._serialize_datetime(run.indexed_at),
|
||||
"finished_at": self._serialize_datetime(run.finished_at),
|
||||
"layout_count": run.layout_count,
|
||||
"structure_node_count": run.structure_node_count,
|
||||
"semantic_block_count": run.semantic_block_count,
|
||||
"vector_chunk_count": run.vector_chunk_count,
|
||||
"chunk_count": run.chunk_count,
|
||||
"failure_stage": run.failure_stage,
|
||||
"error_message": run.error_message,
|
||||
"metadata": run.metadata,
|
||||
}
|
||||
|
||||
def _deserialize_run(self, payload: dict[str, Any]) -> DocumentProcessingRun:
|
||||
"""Deserialize one JSON payload into a processing run dataclass."""
|
||||
return DocumentProcessingRun(
|
||||
run_id=payload["run_id"],
|
||||
doc_id=payload["doc_id"],
|
||||
trigger_type=payload["trigger_type"],
|
||||
run_status=payload["run_status"],
|
||||
parser_backend=payload.get("parser_backend", ""),
|
||||
chunk_backend=payload.get("chunk_backend", ""),
|
||||
embedding_model=payload.get("embedding_model", ""),
|
||||
index_name=payload.get("index_name", ""),
|
||||
started_at=self._deserialize_datetime(payload.get("started_at")) or datetime.now(UTC),
|
||||
stored_at=self._deserialize_datetime(payload.get("stored_at")),
|
||||
parsed_at=self._deserialize_datetime(payload.get("parsed_at")),
|
||||
indexed_at=self._deserialize_datetime(payload.get("indexed_at")),
|
||||
finished_at=self._deserialize_datetime(payload.get("finished_at")),
|
||||
layout_count=int(payload.get("layout_count", 0) or 0),
|
||||
structure_node_count=int(payload.get("structure_node_count", 0) or 0),
|
||||
semantic_block_count=int(payload.get("semantic_block_count", 0) or 0),
|
||||
vector_chunk_count=int(payload.get("vector_chunk_count", 0) or 0),
|
||||
chunk_count=int(payload.get("chunk_count", 0) or 0),
|
||||
failure_stage=payload.get("failure_stage", ""),
|
||||
error_message=payload.get("error_message", ""),
|
||||
metadata=payload.get("metadata", {}),
|
||||
)
|
||||
|
||||
def _serialize_event(self, event: DocumentStatusEvent) -> dict[str, Any]:
|
||||
"""Serialize one status event to a JSON-compatible payload."""
|
||||
return {
|
||||
"event_id": event.event_id,
|
||||
"doc_id": event.doc_id,
|
||||
"run_id": event.run_id,
|
||||
"from_status": event.from_status,
|
||||
"to_status": event.to_status,
|
||||
"stage": event.stage,
|
||||
"message": event.message,
|
||||
"metadata": event.metadata,
|
||||
"occurred_at": self._serialize_datetime(event.occurred_at),
|
||||
}
|
||||
|
||||
def _deserialize_event(self, payload: dict[str, Any]) -> DocumentStatusEvent:
|
||||
"""Deserialize one JSON payload into a status event dataclass."""
|
||||
return DocumentStatusEvent(
|
||||
event_id=payload["event_id"],
|
||||
doc_id=payload["doc_id"],
|
||||
run_id=payload["run_id"],
|
||||
from_status=payload.get("from_status", ""),
|
||||
to_status=payload["to_status"],
|
||||
stage=payload.get("stage", ""),
|
||||
message=payload.get("message", ""),
|
||||
metadata=payload.get("metadata", {}),
|
||||
occurred_at=self._deserialize_datetime(payload.get("occurred_at")) or datetime.now(UTC),
|
||||
)
|
||||
|
||||
def _serialize_artifact(self, artifact: DocumentArtifact) -> dict[str, Any]:
|
||||
"""Serialize one artifact reference to a JSON-compatible payload."""
|
||||
return {
|
||||
"artifact_id": artifact.artifact_id,
|
||||
"doc_id": artifact.doc_id,
|
||||
"run_id": artifact.run_id,
|
||||
"artifact_type": artifact.artifact_type,
|
||||
"object_name": artifact.object_name,
|
||||
"content_type": artifact.content_type,
|
||||
"byte_size": artifact.byte_size,
|
||||
"checksum": artifact.checksum,
|
||||
"metadata": artifact.metadata,
|
||||
"created_at": self._serialize_datetime(artifact.created_at),
|
||||
}
|
||||
|
||||
def _deserialize_artifact(self, payload: dict[str, Any]) -> DocumentArtifact:
|
||||
"""Deserialize one JSON payload into an artifact dataclass."""
|
||||
return DocumentArtifact(
|
||||
artifact_id=payload["artifact_id"],
|
||||
doc_id=payload["doc_id"],
|
||||
run_id=payload["run_id"],
|
||||
artifact_type=payload["artifact_type"],
|
||||
object_name=payload["object_name"],
|
||||
content_type=payload.get("content_type", ""),
|
||||
byte_size=int(payload.get("byte_size", 0) or 0),
|
||||
checksum=payload.get("checksum", ""),
|
||||
metadata=payload.get("metadata", {}),
|
||||
created_at=self._deserialize_datetime(payload.get("created_at")) or datetime.now(UTC),
|
||||
)
|
||||
|
||||
def _merge_metadata(self, original: dict[str, Any], update: dict | None) -> dict[str, Any]:
|
||||
"""Merge metadata updates onto an existing payload."""
|
||||
merged = dict(original)
|
||||
if update:
|
||||
merged.update(update)
|
||||
return merged
|
||||
|
||||
def create_run(self, run: DocumentProcessingRun) -> DocumentProcessingRun:
|
||||
"""Create a new processing run record."""
|
||||
payload = self._load()
|
||||
payload["runs"][run.run_id] = self._serialize_run(run)
|
||||
self._save(payload)
|
||||
return run
|
||||
|
||||
def mark_run_stored(
|
||||
self,
|
||||
run_id: str,
|
||||
*,
|
||||
stored_at: datetime | None = None,
|
||||
metadata: dict | None = None,
|
||||
) -> DocumentProcessingRun | None:
|
||||
"""Mark a run as having persisted the source file."""
|
||||
payload = self._load()
|
||||
run_payload = payload["runs"].get(run_id)
|
||||
if not run_payload:
|
||||
return None
|
||||
run = self._deserialize_run(run_payload)
|
||||
run.stored_at = stored_at or datetime.now(UTC)
|
||||
run.metadata = self._merge_metadata(run.metadata, metadata)
|
||||
payload["runs"][run_id] = self._serialize_run(run)
|
||||
self._save(payload)
|
||||
return run
|
||||
|
||||
def mark_run_parsed(
|
||||
self,
|
||||
run_id: str,
|
||||
*,
|
||||
parser_backend: str,
|
||||
layout_count: int,
|
||||
structure_node_count: int,
|
||||
semantic_block_count: int,
|
||||
vector_chunk_count: int,
|
||||
parsed_at: datetime | None = None,
|
||||
metadata: dict | None = None,
|
||||
) -> DocumentProcessingRun | None:
|
||||
"""Record parse completion details for a run."""
|
||||
payload = self._load()
|
||||
run_payload = payload["runs"].get(run_id)
|
||||
if not run_payload:
|
||||
return None
|
||||
run = self._deserialize_run(run_payload)
|
||||
run.parser_backend = parser_backend
|
||||
run.layout_count = layout_count
|
||||
run.structure_node_count = structure_node_count
|
||||
run.semantic_block_count = semantic_block_count
|
||||
run.vector_chunk_count = vector_chunk_count
|
||||
run.parsed_at = parsed_at or datetime.now(UTC)
|
||||
run.metadata = self._merge_metadata(run.metadata, metadata)
|
||||
payload["runs"][run_id] = self._serialize_run(run)
|
||||
self._save(payload)
|
||||
return run
|
||||
|
||||
def mark_run_indexed(
|
||||
self,
|
||||
run_id: str,
|
||||
*,
|
||||
chunk_count: int,
|
||||
index_name: str,
|
||||
indexed_at: datetime | None = None,
|
||||
finished_at: datetime | None = None,
|
||||
metadata: dict | None = None,
|
||||
) -> DocumentProcessingRun | None:
|
||||
"""Mark a run as successfully indexed."""
|
||||
payload = self._load()
|
||||
run_payload = payload["runs"].get(run_id)
|
||||
if not run_payload:
|
||||
return None
|
||||
run = self._deserialize_run(run_payload)
|
||||
now = datetime.now(UTC)
|
||||
run.run_status = "succeeded"
|
||||
run.chunk_count = chunk_count
|
||||
run.index_name = index_name
|
||||
run.indexed_at = indexed_at or now
|
||||
run.finished_at = finished_at or now
|
||||
run.metadata = self._merge_metadata(run.metadata, metadata)
|
||||
payload["runs"][run_id] = self._serialize_run(run)
|
||||
self._save(payload)
|
||||
return run
|
||||
|
||||
def mark_run_failed(
|
||||
self,
|
||||
run_id: str,
|
||||
*,
|
||||
failure_stage: str,
|
||||
error_message: str,
|
||||
finished_at: datetime | None = None,
|
||||
metadata: dict | None = None,
|
||||
) -> DocumentProcessingRun | None:
|
||||
"""Mark a run as failed."""
|
||||
payload = self._load()
|
||||
run_payload = payload["runs"].get(run_id)
|
||||
if not run_payload:
|
||||
return None
|
||||
run = self._deserialize_run(run_payload)
|
||||
run.run_status = "failed"
|
||||
run.failure_stage = failure_stage
|
||||
run.error_message = error_message
|
||||
run.finished_at = finished_at or datetime.now(UTC)
|
||||
run.metadata = self._merge_metadata(run.metadata, metadata)
|
||||
payload["runs"][run_id] = self._serialize_run(run)
|
||||
self._save(payload)
|
||||
return run
|
||||
|
||||
def append_status_event(self, event: DocumentStatusEvent) -> DocumentStatusEvent:
|
||||
"""Append a document status event."""
|
||||
payload = self._load()
|
||||
payload["status_events"][event.event_id] = self._serialize_event(event)
|
||||
self._save(payload)
|
||||
return event
|
||||
|
||||
def replace_artifacts_for_run(self, run_id: str, artifacts: list[DocumentArtifact]) -> list[DocumentArtifact]:
|
||||
"""Replace all artifacts for a run with the provided list."""
|
||||
payload = self._load()
|
||||
payload["artifacts"] = {
|
||||
artifact_id: artifact_payload
|
||||
for artifact_id, artifact_payload in payload["artifacts"].items()
|
||||
if artifact_payload.get("run_id") != run_id
|
||||
}
|
||||
for artifact in artifacts:
|
||||
payload["artifacts"][artifact.artifact_id] = self._serialize_artifact(artifact)
|
||||
self._save(payload)
|
||||
return artifacts
|
||||
|
||||
def delete_by_document(self, doc_id: str) -> None:
|
||||
"""Delete all processing data for a document."""
|
||||
payload = self._load()
|
||||
payload["runs"] = {
|
||||
run_id: run_payload
|
||||
for run_id, run_payload in payload["runs"].items()
|
||||
if run_payload.get("doc_id") != doc_id
|
||||
}
|
||||
payload["status_events"] = {
|
||||
event_id: event_payload
|
||||
for event_id, event_payload in payload["status_events"].items()
|
||||
if event_payload.get("doc_id") != doc_id
|
||||
}
|
||||
payload["artifacts"] = {
|
||||
artifact_id: artifact_payload
|
||||
for artifact_id, artifact_payload in payload["artifacts"].items()
|
||||
if artifact_payload.get("doc_id") != doc_id
|
||||
}
|
||||
self._save(payload)
|
||||
|
||||
def list_runs_by_document(self, doc_id: str) -> list[DocumentProcessingRun]:
|
||||
"""List all processing runs for a document."""
|
||||
payload = self._load()
|
||||
runs = [
|
||||
self._deserialize_run(run_payload)
|
||||
for run_payload in payload["runs"].values()
|
||||
if run_payload.get("doc_id") == doc_id
|
||||
]
|
||||
runs.sort(key=lambda run: run.started_at)
|
||||
return runs
|
||||
|
||||
def get_run(self, run_id: str) -> DocumentProcessingRun | None:
|
||||
"""Return one processing run by identifier."""
|
||||
payload = self._load()
|
||||
run_payload = payload["runs"].get(run_id)
|
||||
return self._deserialize_run(run_payload) if run_payload else None
|
||||
|
||||
def list_status_events_by_document(self, doc_id: str) -> list[DocumentStatusEvent]:
|
||||
"""List status events for a document."""
|
||||
payload = self._load()
|
||||
events = [
|
||||
self._deserialize_event(event_payload)
|
||||
for event_payload in payload["status_events"].values()
|
||||
if event_payload.get("doc_id") == doc_id
|
||||
]
|
||||
events.sort(key=lambda event: event.occurred_at)
|
||||
return events
|
||||
|
||||
def list_status_events_by_run(self, run_id: str) -> list[DocumentStatusEvent]:
|
||||
"""List status events for a run."""
|
||||
payload = self._load()
|
||||
events = [
|
||||
self._deserialize_event(event_payload)
|
||||
for event_payload in payload["status_events"].values()
|
||||
if event_payload.get("run_id") == run_id
|
||||
]
|
||||
events.sort(key=lambda event: event.occurred_at)
|
||||
return events
|
||||
|
||||
def list_artifacts_by_document(self, doc_id: str) -> list[DocumentArtifact]:
|
||||
"""List artifact references for a document."""
|
||||
payload = self._load()
|
||||
artifacts = [
|
||||
self._deserialize_artifact(artifact_payload)
|
||||
for artifact_payload in payload["artifacts"].values()
|
||||
if artifact_payload.get("doc_id") == doc_id
|
||||
]
|
||||
artifacts.sort(key=lambda artifact: artifact.created_at)
|
||||
return artifacts
|
||||
|
||||
def list_artifacts_by_run(self, run_id: str) -> list[DocumentArtifact]:
|
||||
"""List artifact references for a run."""
|
||||
payload = self._load()
|
||||
artifacts = [
|
||||
self._deserialize_artifact(artifact_payload)
|
||||
for artifact_payload in payload["artifacts"].values()
|
||||
if artifact_payload.get("run_id") == run_id
|
||||
]
|
||||
artifacts.sort(key=lambda artifact: artifact.created_at)
|
||||
return artifacts
|
||||
@@ -0,0 +1,466 @@
|
||||
"""Implement infrastructure support for postgres document processing history."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
from contextlib import contextmanager
|
||||
from datetime import UTC, datetime
|
||||
from typing import Any
|
||||
|
||||
import psycopg2
|
||||
import psycopg2.extras
|
||||
from psycopg2.pool import ThreadedConnectionPool
|
||||
|
||||
from app.config.settings import settings
|
||||
from app.domain.documents import DocumentArtifact, DocumentProcessingRun, DocumentProcessingStore, DocumentStatusEvent
|
||||
# Keep SQL mapping local to this adapter so the domain stays storage-agnostic.
|
||||
|
||||
_CREATE_RUNS_TABLE = """
|
||||
CREATE TABLE IF NOT EXISTS document_processing_runs (
|
||||
run_id VARCHAR(128) PRIMARY KEY,
|
||||
doc_id VARCHAR(128) NOT NULL,
|
||||
trigger_type VARCHAR(32) NOT NULL,
|
||||
run_status VARCHAR(32) NOT NULL DEFAULT 'running',
|
||||
parser_backend VARCHAR(128) NOT NULL DEFAULT '',
|
||||
chunk_backend VARCHAR(128) NOT NULL DEFAULT '',
|
||||
embedding_model VARCHAR(256) NOT NULL DEFAULT '',
|
||||
index_name VARCHAR(128) NOT NULL DEFAULT '',
|
||||
started_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
|
||||
stored_at TIMESTAMPTZ,
|
||||
parsed_at TIMESTAMPTZ,
|
||||
indexed_at TIMESTAMPTZ,
|
||||
finished_at TIMESTAMPTZ,
|
||||
layout_count INTEGER NOT NULL DEFAULT 0,
|
||||
structure_node_count INTEGER NOT NULL DEFAULT 0,
|
||||
semantic_block_count INTEGER NOT NULL DEFAULT 0,
|
||||
vector_chunk_count INTEGER NOT NULL DEFAULT 0,
|
||||
chunk_count INTEGER NOT NULL DEFAULT 0,
|
||||
failure_stage VARCHAR(64) NOT NULL DEFAULT '',
|
||||
error_message TEXT NOT NULL DEFAULT '',
|
||||
metadata JSONB NOT NULL DEFAULT '{}',
|
||||
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
|
||||
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
|
||||
CONSTRAINT fk_dpr_doc FOREIGN KEY (doc_id) REFERENCES documents(doc_id) ON DELETE CASCADE
|
||||
);
|
||||
CREATE INDEX IF NOT EXISTS idx_document_processing_runs_doc_id ON document_processing_runs(doc_id, started_at DESC);
|
||||
"""
|
||||
|
||||
_CREATE_EVENTS_TABLE = """
|
||||
CREATE TABLE IF NOT EXISTS document_status_history (
|
||||
event_id VARCHAR(128) PRIMARY KEY,
|
||||
doc_id VARCHAR(128) NOT NULL,
|
||||
run_id VARCHAR(128) NOT NULL,
|
||||
from_status VARCHAR(32) NOT NULL DEFAULT '',
|
||||
to_status VARCHAR(32) NOT NULL,
|
||||
stage VARCHAR(64) NOT NULL DEFAULT '',
|
||||
message TEXT NOT NULL DEFAULT '',
|
||||
metadata JSONB NOT NULL DEFAULT '{}',
|
||||
occurred_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
|
||||
CONSTRAINT fk_dsh_doc FOREIGN KEY (doc_id) REFERENCES documents(doc_id) ON DELETE CASCADE,
|
||||
CONSTRAINT fk_dsh_run FOREIGN KEY (run_id) REFERENCES document_processing_runs(run_id) ON DELETE CASCADE
|
||||
);
|
||||
CREATE INDEX IF NOT EXISTS idx_document_status_history_doc_id ON document_status_history(doc_id, occurred_at ASC);
|
||||
CREATE INDEX IF NOT EXISTS idx_document_status_history_run_id ON document_status_history(run_id, occurred_at ASC);
|
||||
"""
|
||||
|
||||
_CREATE_ARTIFACTS_TABLE = """
|
||||
CREATE TABLE IF NOT EXISTS document_artifacts (
|
||||
artifact_id VARCHAR(128) PRIMARY KEY,
|
||||
doc_id VARCHAR(128) NOT NULL,
|
||||
run_id VARCHAR(128) NOT NULL,
|
||||
artifact_type VARCHAR(64) NOT NULL,
|
||||
object_name VARCHAR(1024) NOT NULL,
|
||||
content_type VARCHAR(128) NOT NULL DEFAULT '',
|
||||
byte_size BIGINT NOT NULL DEFAULT 0,
|
||||
checksum VARCHAR(256) NOT NULL DEFAULT '',
|
||||
metadata JSONB NOT NULL DEFAULT '{}',
|
||||
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
|
||||
CONSTRAINT fk_da_doc FOREIGN KEY (doc_id) REFERENCES documents(doc_id) ON DELETE CASCADE,
|
||||
CONSTRAINT fk_da_run FOREIGN KEY (run_id) REFERENCES document_processing_runs(run_id) ON DELETE CASCADE
|
||||
);
|
||||
CREATE INDEX IF NOT EXISTS idx_document_artifacts_doc_id ON document_artifacts(doc_id, created_at ASC);
|
||||
CREATE INDEX IF NOT EXISTS idx_document_artifacts_run_id ON document_artifacts(run_id, created_at ASC);
|
||||
"""
|
||||
|
||||
|
||||
class PostgresDocumentProcessingStore(DocumentProcessingStore):
|
||||
"""Persist processing history in PostgreSQL using handwritten SQL."""
|
||||
|
||||
def __init__(self) -> None:
|
||||
"""Initialize the store and ensure the required tables exist."""
|
||||
self._pool = ThreadedConnectionPool(
|
||||
minconn=1,
|
||||
maxconn=5,
|
||||
host=settings.postgres_host,
|
||||
port=settings.postgres_port,
|
||||
user=settings.postgres_user,
|
||||
password=settings.postgres_password,
|
||||
dbname=settings.postgres_db,
|
||||
)
|
||||
self._ensure_schema()
|
||||
|
||||
def _ensure_schema(self) -> None:
|
||||
"""Create processing history tables and indexes if they are missing."""
|
||||
with self._conn() as conn:
|
||||
with conn.cursor() as cur:
|
||||
cur.execute(_CREATE_RUNS_TABLE)
|
||||
cur.execute(_CREATE_EVENTS_TABLE)
|
||||
cur.execute(_CREATE_ARTIFACTS_TABLE)
|
||||
conn.commit()
|
||||
|
||||
@contextmanager
|
||||
def _conn(self):
|
||||
"""Borrow one connection from the pool and return it afterwards."""
|
||||
conn = self._pool.getconn()
|
||||
try:
|
||||
yield conn
|
||||
finally:
|
||||
self._pool.putconn(conn)
|
||||
|
||||
def _normalize_metadata(self, value: Any) -> dict[str, Any]:
|
||||
"""Return a JSON-object payload regardless of the row representation."""
|
||||
if isinstance(value, dict):
|
||||
return value
|
||||
if not value:
|
||||
return {}
|
||||
return json.loads(value)
|
||||
|
||||
def _row_to_run(self, row: dict[str, Any]) -> DocumentProcessingRun:
|
||||
"""Map one run row into the domain dataclass."""
|
||||
return DocumentProcessingRun(
|
||||
run_id=row["run_id"],
|
||||
doc_id=row["doc_id"],
|
||||
trigger_type=row["trigger_type"],
|
||||
run_status=row["run_status"],
|
||||
parser_backend=row["parser_backend"],
|
||||
chunk_backend=row["chunk_backend"],
|
||||
embedding_model=row["embedding_model"],
|
||||
index_name=row["index_name"],
|
||||
started_at=row["started_at"],
|
||||
stored_at=row["stored_at"],
|
||||
parsed_at=row["parsed_at"],
|
||||
indexed_at=row["indexed_at"],
|
||||
finished_at=row["finished_at"],
|
||||
layout_count=row["layout_count"],
|
||||
structure_node_count=row["structure_node_count"],
|
||||
semantic_block_count=row["semantic_block_count"],
|
||||
vector_chunk_count=row["vector_chunk_count"],
|
||||
chunk_count=row["chunk_count"],
|
||||
failure_stage=row["failure_stage"],
|
||||
error_message=row["error_message"],
|
||||
metadata=self._normalize_metadata(row["metadata"]),
|
||||
)
|
||||
|
||||
def _row_to_event(self, row: dict[str, Any]) -> DocumentStatusEvent:
|
||||
"""Map one event row into the domain dataclass."""
|
||||
return DocumentStatusEvent(
|
||||
event_id=row["event_id"],
|
||||
doc_id=row["doc_id"],
|
||||
run_id=row["run_id"],
|
||||
from_status=row["from_status"],
|
||||
to_status=row["to_status"],
|
||||
stage=row["stage"],
|
||||
message=row["message"],
|
||||
metadata=self._normalize_metadata(row["metadata"]),
|
||||
occurred_at=row["occurred_at"],
|
||||
)
|
||||
|
||||
def _row_to_artifact(self, row: dict[str, Any]) -> DocumentArtifact:
|
||||
"""Map one artifact row into the domain dataclass."""
|
||||
return DocumentArtifact(
|
||||
artifact_id=row["artifact_id"],
|
||||
doc_id=row["doc_id"],
|
||||
run_id=row["run_id"],
|
||||
artifact_type=row["artifact_type"],
|
||||
object_name=row["object_name"],
|
||||
content_type=row["content_type"],
|
||||
byte_size=row["byte_size"],
|
||||
checksum=row["checksum"],
|
||||
metadata=self._normalize_metadata(row["metadata"]),
|
||||
created_at=row["created_at"],
|
||||
)
|
||||
|
||||
def _update_run(
|
||||
self,
|
||||
run_id: str,
|
||||
*,
|
||||
assignments: dict[str, Any],
|
||||
metadata: dict | None = None,
|
||||
) -> DocumentProcessingRun | None:
|
||||
"""Update one run row and return the latest stored state."""
|
||||
set_clauses = []
|
||||
params: dict[str, Any] = {"run_id": run_id, "updated_at": datetime.now(UTC)}
|
||||
for key, value in assignments.items():
|
||||
set_clauses.append(f"{key} = %({key})s")
|
||||
params[key] = value
|
||||
set_clauses.append("updated_at = %(updated_at)s")
|
||||
if metadata is not None:
|
||||
set_clauses.append("metadata = COALESCE(metadata, '{}'::jsonb) || %(metadata)s::jsonb")
|
||||
params["metadata"] = json.dumps(metadata, ensure_ascii=False)
|
||||
sql = f"""
|
||||
UPDATE document_processing_runs
|
||||
SET {", ".join(set_clauses)}
|
||||
WHERE run_id = %(run_id)s
|
||||
RETURNING *
|
||||
"""
|
||||
with self._conn() as conn:
|
||||
with conn.cursor(cursor_factory=psycopg2.extras.RealDictCursor) as cur:
|
||||
cur.execute(sql, params)
|
||||
row = cur.fetchone()
|
||||
conn.commit()
|
||||
return self._row_to_run(dict(row)) if row else None
|
||||
|
||||
def create_run(self, run: DocumentProcessingRun) -> DocumentProcessingRun:
|
||||
"""Create a new processing run record."""
|
||||
sql = """
|
||||
INSERT INTO document_processing_runs
|
||||
(run_id, doc_id, trigger_type, run_status, parser_backend, chunk_backend,
|
||||
embedding_model, index_name, started_at, stored_at, parsed_at, indexed_at,
|
||||
finished_at, layout_count, structure_node_count, semantic_block_count,
|
||||
vector_chunk_count, chunk_count, failure_stage, error_message, metadata)
|
||||
VALUES
|
||||
(%(run_id)s, %(doc_id)s, %(trigger_type)s, %(run_status)s, %(parser_backend)s,
|
||||
%(chunk_backend)s, %(embedding_model)s, %(index_name)s, %(started_at)s,
|
||||
%(stored_at)s, %(parsed_at)s, %(indexed_at)s, %(finished_at)s, %(layout_count)s,
|
||||
%(structure_node_count)s, %(semantic_block_count)s, %(vector_chunk_count)s,
|
||||
%(chunk_count)s, %(failure_stage)s, %(error_message)s, %(metadata)s)
|
||||
"""
|
||||
with self._conn() as conn:
|
||||
with conn.cursor() as cur:
|
||||
cur.execute(
|
||||
sql,
|
||||
{
|
||||
"run_id": run.run_id,
|
||||
"doc_id": run.doc_id,
|
||||
"trigger_type": run.trigger_type,
|
||||
"run_status": run.run_status,
|
||||
"parser_backend": run.parser_backend,
|
||||
"chunk_backend": run.chunk_backend,
|
||||
"embedding_model": run.embedding_model,
|
||||
"index_name": run.index_name,
|
||||
"started_at": run.started_at,
|
||||
"stored_at": run.stored_at,
|
||||
"parsed_at": run.parsed_at,
|
||||
"indexed_at": run.indexed_at,
|
||||
"finished_at": run.finished_at,
|
||||
"layout_count": run.layout_count,
|
||||
"structure_node_count": run.structure_node_count,
|
||||
"semantic_block_count": run.semantic_block_count,
|
||||
"vector_chunk_count": run.vector_chunk_count,
|
||||
"chunk_count": run.chunk_count,
|
||||
"failure_stage": run.failure_stage,
|
||||
"error_message": run.error_message,
|
||||
"metadata": json.dumps(run.metadata, ensure_ascii=False),
|
||||
},
|
||||
)
|
||||
conn.commit()
|
||||
return run
|
||||
|
||||
def mark_run_stored(
|
||||
self,
|
||||
run_id: str,
|
||||
*,
|
||||
stored_at: datetime | None = None,
|
||||
metadata: dict | None = None,
|
||||
) -> DocumentProcessingRun | None:
|
||||
"""Mark a run as having persisted its source file."""
|
||||
return self._update_run(
|
||||
run_id,
|
||||
assignments={"stored_at": stored_at or datetime.now(UTC)},
|
||||
metadata=metadata,
|
||||
)
|
||||
|
||||
def mark_run_parsed(
|
||||
self,
|
||||
run_id: str,
|
||||
*,
|
||||
parser_backend: str,
|
||||
layout_count: int,
|
||||
structure_node_count: int,
|
||||
semantic_block_count: int,
|
||||
vector_chunk_count: int,
|
||||
parsed_at: datetime | None = None,
|
||||
metadata: dict | None = None,
|
||||
) -> DocumentProcessingRun | None:
|
||||
"""Record parse completion metrics for a run."""
|
||||
return self._update_run(
|
||||
run_id,
|
||||
assignments={
|
||||
"parser_backend": parser_backend,
|
||||
"parsed_at": parsed_at or datetime.now(UTC),
|
||||
"layout_count": layout_count,
|
||||
"structure_node_count": structure_node_count,
|
||||
"semantic_block_count": semantic_block_count,
|
||||
"vector_chunk_count": vector_chunk_count,
|
||||
},
|
||||
metadata=metadata,
|
||||
)
|
||||
|
||||
def mark_run_indexed(
|
||||
self,
|
||||
run_id: str,
|
||||
*,
|
||||
chunk_count: int,
|
||||
index_name: str,
|
||||
indexed_at: datetime | None = None,
|
||||
finished_at: datetime | None = None,
|
||||
metadata: dict | None = None,
|
||||
) -> DocumentProcessingRun | None:
|
||||
"""Mark a run as successfully indexed."""
|
||||
now = datetime.now(UTC)
|
||||
return self._update_run(
|
||||
run_id,
|
||||
assignments={
|
||||
"run_status": "succeeded",
|
||||
"chunk_count": chunk_count,
|
||||
"index_name": index_name,
|
||||
"indexed_at": indexed_at or now,
|
||||
"finished_at": finished_at or now,
|
||||
},
|
||||
metadata=metadata,
|
||||
)
|
||||
|
||||
def mark_run_failed(
|
||||
self,
|
||||
run_id: str,
|
||||
*,
|
||||
failure_stage: str,
|
||||
error_message: str,
|
||||
finished_at: datetime | None = None,
|
||||
metadata: dict | None = None,
|
||||
) -> DocumentProcessingRun | None:
|
||||
"""Mark a run as failed and persist the terminal error details."""
|
||||
return self._update_run(
|
||||
run_id,
|
||||
assignments={
|
||||
"run_status": "failed",
|
||||
"failure_stage": failure_stage,
|
||||
"error_message": error_message,
|
||||
"finished_at": finished_at or datetime.now(UTC),
|
||||
},
|
||||
metadata=metadata,
|
||||
)
|
||||
|
||||
def append_status_event(self, event: DocumentStatusEvent) -> DocumentStatusEvent:
|
||||
"""Append a document status event."""
|
||||
sql = """
|
||||
INSERT INTO document_status_history
|
||||
(event_id, doc_id, run_id, from_status, to_status, stage, message, metadata, occurred_at)
|
||||
VALUES
|
||||
(%(event_id)s, %(doc_id)s, %(run_id)s, %(from_status)s, %(to_status)s,
|
||||
%(stage)s, %(message)s, %(metadata)s, %(occurred_at)s)
|
||||
"""
|
||||
with self._conn() as conn:
|
||||
with conn.cursor() as cur:
|
||||
cur.execute(
|
||||
sql,
|
||||
{
|
||||
"event_id": event.event_id,
|
||||
"doc_id": event.doc_id,
|
||||
"run_id": event.run_id,
|
||||
"from_status": event.from_status,
|
||||
"to_status": event.to_status,
|
||||
"stage": event.stage,
|
||||
"message": event.message,
|
||||
"metadata": json.dumps(event.metadata, ensure_ascii=False),
|
||||
"occurred_at": event.occurred_at,
|
||||
},
|
||||
)
|
||||
conn.commit()
|
||||
return event
|
||||
|
||||
def replace_artifacts_for_run(self, run_id: str, artifacts: list[DocumentArtifact]) -> list[DocumentArtifact]:
|
||||
"""Replace all artifact references for one run using a delete-then-insert strategy."""
|
||||
with self._conn() as conn:
|
||||
with conn.cursor() as cur:
|
||||
cur.execute("DELETE FROM document_artifacts WHERE run_id = %s", (run_id,))
|
||||
if artifacts:
|
||||
psycopg2.extras.execute_values(
|
||||
cur,
|
||||
"""
|
||||
INSERT INTO document_artifacts
|
||||
(artifact_id, doc_id, run_id, artifact_type, object_name,
|
||||
content_type, byte_size, checksum, metadata, created_at)
|
||||
VALUES %s
|
||||
""",
|
||||
[
|
||||
(
|
||||
artifact.artifact_id,
|
||||
artifact.doc_id,
|
||||
artifact.run_id,
|
||||
artifact.artifact_type,
|
||||
artifact.object_name,
|
||||
artifact.content_type,
|
||||
artifact.byte_size,
|
||||
artifact.checksum,
|
||||
json.dumps(artifact.metadata, ensure_ascii=False),
|
||||
artifact.created_at,
|
||||
)
|
||||
for artifact in artifacts
|
||||
],
|
||||
)
|
||||
conn.commit()
|
||||
return artifacts
|
||||
|
||||
def delete_by_document(self, doc_id: str) -> None:
|
||||
"""Delete all processing rows for a document explicitly."""
|
||||
with self._conn() as conn:
|
||||
with conn.cursor() as cur:
|
||||
cur.execute("DELETE FROM document_status_history WHERE doc_id = %s", (doc_id,))
|
||||
cur.execute("DELETE FROM document_artifacts WHERE doc_id = %s", (doc_id,))
|
||||
cur.execute("DELETE FROM document_processing_runs WHERE doc_id = %s", (doc_id,))
|
||||
conn.commit()
|
||||
|
||||
def list_runs_by_document(self, doc_id: str) -> list[DocumentProcessingRun]:
|
||||
"""List processing runs for a document in chronological order."""
|
||||
sql = "SELECT * FROM document_processing_runs WHERE doc_id = %s ORDER BY started_at ASC"
|
||||
with self._conn() as conn:
|
||||
with conn.cursor(cursor_factory=psycopg2.extras.RealDictCursor) as cur:
|
||||
cur.execute(sql, (doc_id,))
|
||||
rows = cur.fetchall()
|
||||
return [self._row_to_run(dict(row)) for row in rows]
|
||||
|
||||
def get_run(self, run_id: str) -> DocumentProcessingRun | None:
|
||||
"""Return one processing run by identifier."""
|
||||
sql = "SELECT * FROM document_processing_runs WHERE run_id = %s"
|
||||
with self._conn() as conn:
|
||||
with conn.cursor(cursor_factory=psycopg2.extras.RealDictCursor) as cur:
|
||||
cur.execute(sql, (run_id,))
|
||||
row = cur.fetchone()
|
||||
return self._row_to_run(dict(row)) if row else None
|
||||
|
||||
def list_status_events_by_document(self, doc_id: str) -> list[DocumentStatusEvent]:
|
||||
"""List all status events for a document."""
|
||||
sql = "SELECT * FROM document_status_history WHERE doc_id = %s ORDER BY occurred_at ASC"
|
||||
with self._conn() as conn:
|
||||
with conn.cursor(cursor_factory=psycopg2.extras.RealDictCursor) as cur:
|
||||
cur.execute(sql, (doc_id,))
|
||||
rows = cur.fetchall()
|
||||
return [self._row_to_event(dict(row)) for row in rows]
|
||||
|
||||
def list_status_events_by_run(self, run_id: str) -> list[DocumentStatusEvent]:
|
||||
"""List all status events for a run."""
|
||||
sql = "SELECT * FROM document_status_history WHERE run_id = %s ORDER BY occurred_at ASC"
|
||||
with self._conn() as conn:
|
||||
with conn.cursor(cursor_factory=psycopg2.extras.RealDictCursor) as cur:
|
||||
cur.execute(sql, (run_id,))
|
||||
rows = cur.fetchall()
|
||||
return [self._row_to_event(dict(row)) for row in rows]
|
||||
|
||||
def list_artifacts_by_document(self, doc_id: str) -> list[DocumentArtifact]:
|
||||
"""List all artifact references for a document."""
|
||||
sql = "SELECT * FROM document_artifacts WHERE doc_id = %s ORDER BY created_at ASC"
|
||||
with self._conn() as conn:
|
||||
with conn.cursor(cursor_factory=psycopg2.extras.RealDictCursor) as cur:
|
||||
cur.execute(sql, (doc_id,))
|
||||
rows = cur.fetchall()
|
||||
return [self._row_to_artifact(dict(row)) for row in rows]
|
||||
|
||||
def list_artifacts_by_run(self, run_id: str) -> list[DocumentArtifact]:
|
||||
"""List all artifact references for a run."""
|
||||
sql = "SELECT * FROM document_artifacts WHERE run_id = %s ORDER BY created_at ASC"
|
||||
with self._conn() as conn:
|
||||
with conn.cursor(cursor_factory=psycopg2.extras.RealDictCursor) as cur:
|
||||
cur.execute(sql, (run_id,))
|
||||
rows = cur.fetchall()
|
||||
return [self._row_to_artifact(dict(row)) for row in rows]
|
||||
@@ -56,7 +56,21 @@ class BM25Retriever:
|
||||
try:
|
||||
rows = self._vector_index.collection.query(
|
||||
expr='doc_id != ""',
|
||||
output_fields=["id", "doc_id", "doc_name", "content", "section_title", "page_number"],
|
||||
output_fields=[
|
||||
"id",
|
||||
"chunk_id",
|
||||
"doc_id",
|
||||
"doc_title",
|
||||
"text",
|
||||
"chunk_type",
|
||||
"section_title",
|
||||
"page_start",
|
||||
"page_end",
|
||||
"section_level",
|
||||
"chunk_index",
|
||||
"piece_index",
|
||||
"metadata_json",
|
||||
],
|
||||
limit=16384,
|
||||
)
|
||||
except Exception:
|
||||
@@ -64,19 +78,33 @@ class BM25Retriever:
|
||||
return []
|
||||
return [
|
||||
RetrievedChunk(
|
||||
chunk_id=str(row.get("id", "")),
|
||||
chunk_id=str(row.get("chunk_id") or row.get("id", "")),
|
||||
doc_id=str(row.get("doc_id", "")),
|
||||
doc_name=str(row.get("doc_name", "")),
|
||||
content=str(row.get("content", "")),
|
||||
doc_title=str(row.get("doc_title", "")),
|
||||
text=str(row.get("text", "")),
|
||||
score=0.0,
|
||||
chunk_type=str(row.get("chunk_type", "")),
|
||||
section_title=str(row.get("section_title", "")),
|
||||
page_number=int(row.get("page_number") or 0),
|
||||
metadata={},
|
||||
page_start=int(row.get("page_start") or 0),
|
||||
page_end=int(row.get("page_end") or 0),
|
||||
section_level=int(row.get("section_level") or 0),
|
||||
chunk_index=int(row.get("chunk_index") or 0),
|
||||
piece_index=int(row.get("piece_index") or 0),
|
||||
metadata=self._parse_metadata_json(row.get("metadata_json", "")),
|
||||
)
|
||||
for row in rows
|
||||
if row.get("content")
|
||||
if row.get("text")
|
||||
]
|
||||
|
||||
def _parse_metadata_json(self, raw_metadata: str) -> dict:
|
||||
"""Parse metadata_json into a dict for BM25-side filtering."""
|
||||
if not raw_metadata:
|
||||
return {}
|
||||
try:
|
||||
return dict(__import__("json").loads(raw_metadata))
|
||||
except Exception:
|
||||
return {}
|
||||
|
||||
def _ensure_built(self) -> None:
|
||||
if self._index is not None:
|
||||
return
|
||||
@@ -93,7 +121,7 @@ class BM25Retriever:
|
||||
self._chunks = []
|
||||
self._index = BM25Okapi([[]])
|
||||
return
|
||||
tokenized = [_tokenize(c.content) for c in chunks]
|
||||
tokenized = [_tokenize(c.text) for c in chunks]
|
||||
self._chunks = chunks
|
||||
self._index = BM25Okapi(tokenized)
|
||||
logger.info("BM25Retriever: index built with %d chunks", len(chunks))
|
||||
@@ -127,20 +155,26 @@ class BM25Retriever:
|
||||
for score, chunk in ranked[: top_k * 2]:
|
||||
if score <= 0:
|
||||
break
|
||||
# Apply simple regulation_type filter if provided
|
||||
if filters and chunk.metadata.get("regulation_type"):
|
||||
types = [t.strip() for t in filters.split(",")]
|
||||
if chunk.metadata.get("regulation_type") not in types:
|
||||
if filters:
|
||||
normalized_filter = filters.replace("doc_name", "doc_title").strip()
|
||||
if normalized_filter.startswith('doc_title == "'):
|
||||
expected_title = normalized_filter[len('doc_title == "'):-1]
|
||||
if chunk.doc_title != expected_title:
|
||||
continue
|
||||
results.append(
|
||||
RetrievedChunk(
|
||||
chunk_id=chunk.chunk_id,
|
||||
doc_id=chunk.doc_id,
|
||||
doc_name=chunk.doc_name,
|
||||
content=chunk.content,
|
||||
doc_title=chunk.doc_title,
|
||||
text=chunk.text,
|
||||
score=score,
|
||||
chunk_type=chunk.chunk_type,
|
||||
section_title=chunk.section_title,
|
||||
page_number=chunk.page_number,
|
||||
page_start=chunk.page_start,
|
||||
page_end=chunk.page_end,
|
||||
section_level=chunk.section_level,
|
||||
chunk_index=chunk.chunk_index,
|
||||
piece_index=chunk.piece_index,
|
||||
metadata=chunk.metadata,
|
||||
)
|
||||
)
|
||||
|
||||
@@ -31,7 +31,7 @@ class OpenAICompatibleReranker(Reranker):
|
||||
if not chunks:
|
||||
return []
|
||||
|
||||
texts = [chunk.content for chunk in chunks]
|
||||
texts = [chunk.text for chunk in chunks]
|
||||
start = time.time()
|
||||
try:
|
||||
scores = self._call_reranker(query, texts)
|
||||
|
||||
@@ -4,57 +4,150 @@ from __future__ import annotations
|
||||
|
||||
import json
|
||||
import time
|
||||
from typing import Iterable
|
||||
|
||||
from pymilvus import Collection, CollectionSchema, DataType, FieldSchema, connections, utility
|
||||
from loguru import logger
|
||||
from pymilvus import Collection, CollectionSchema, DataType, FieldSchema, MilvusException, connections, utility
|
||||
|
||||
from app.config.settings import settings
|
||||
from app.domain.documents import Chunk
|
||||
from app.domain.retrieval import RetrievedChunk, VectorIndex
|
||||
from app.shared.errors import VectorStoreSchemaError
|
||||
# Keep adapter behavior explicit so integration details remain easy to audit.
|
||||
|
||||
|
||||
_REQUIRED_SCHEMA_FIELDS = (
|
||||
"doc_id",
|
||||
"doc_title",
|
||||
"chunk_id",
|
||||
"text",
|
||||
"embedding",
|
||||
"section_title",
|
||||
"metadata_json",
|
||||
)
|
||||
_SCHEMA_RECOVERY_TOKENS = (
|
||||
"field doc_title not exist",
|
||||
"field text not exist",
|
||||
"field embedding not exist",
|
||||
"collection not loaded",
|
||||
"can't find collection",
|
||||
"not found[collection",
|
||||
)
|
||||
|
||||
|
||||
|
||||
class MilvusVectorIndex(VectorIndex):
|
||||
"""Provide the Milvus Vector Index index implementation."""
|
||||
|
||||
def __init__(self) -> None:
|
||||
"""Initialize the Milvus Vector Index instance."""
|
||||
self.collection_name = settings.milvus_collection
|
||||
self.db_name = settings.milvus_db_name
|
||||
self.host = settings.milvus_host
|
||||
self.port = settings.milvus_port
|
||||
# Use an adapter-specific alias so this index never reuses unrelated global Milvus state.
|
||||
self.alias = f"vector-index::{self.host}:{self.port}/{self.db_name}/{self.collection_name}"
|
||||
self._connect()
|
||||
self.collection = self._bind_collection()
|
||||
|
||||
def _connect(self, *, refresh: bool = False) -> None:
|
||||
"""Establish the Milvus connection for this adapter."""
|
||||
if refresh:
|
||||
try:
|
||||
connections.disconnect(self.alias)
|
||||
except Exception:
|
||||
# Best-effort disconnect keeps refresh idempotent when no alias is active yet.
|
||||
pass
|
||||
connections.connect(
|
||||
alias="default",
|
||||
host=settings.milvus_host,
|
||||
port=settings.milvus_port,
|
||||
alias=self.alias,
|
||||
host=self.host,
|
||||
port=self.port,
|
||||
db_name=self.db_name,
|
||||
)
|
||||
self.collection = self._ensure_collection()
|
||||
|
||||
def _schema_field_names(self, collection: Collection) -> list[str]:
|
||||
"""Return the field names exposed by the bound Milvus collection."""
|
||||
return [field.name for field in collection.schema.fields]
|
||||
|
||||
def _raise_schema_error(self, *, message: str, actual_fields: Iterable[str]) -> None:
|
||||
"""Raise a typed schema error for the active collection."""
|
||||
raise VectorStoreSchemaError(
|
||||
message=message,
|
||||
host=self.host,
|
||||
db_name=self.db_name,
|
||||
collection_name=self.collection_name,
|
||||
expected_fields=list(_REQUIRED_SCHEMA_FIELDS),
|
||||
actual_fields=list(actual_fields),
|
||||
)
|
||||
|
||||
def _validate_schema(self, collection: Collection) -> None:
|
||||
"""Ensure the collection schema matches the dense-only adapter contract."""
|
||||
actual_fields = self._schema_field_names(collection)
|
||||
missing_fields = [field_name for field_name in _REQUIRED_SCHEMA_FIELDS if field_name not in actual_fields]
|
||||
if missing_fields:
|
||||
self._raise_schema_error(
|
||||
message=f"Milvus collection schema mismatch; missing required fields: {missing_fields}",
|
||||
actual_fields=actual_fields,
|
||||
)
|
||||
|
||||
def _log_collection_binding(self, collection: Collection, *, event: str) -> None:
|
||||
"""Record the bound collection details for runtime diagnostics."""
|
||||
try:
|
||||
num_entities = collection.num_entities
|
||||
except Exception:
|
||||
num_entities = "unknown"
|
||||
logger.info(
|
||||
"Milvus binding {} alias={} host={} db={} collection={} fields={} num_entities={}",
|
||||
event,
|
||||
self.alias,
|
||||
self.host,
|
||||
self.db_name,
|
||||
self.collection_name,
|
||||
self._schema_field_names(collection),
|
||||
num_entities,
|
||||
)
|
||||
|
||||
def _bind_collection(self, *, force_refresh: bool = False) -> Collection:
|
||||
"""Bind and validate the configured Milvus collection."""
|
||||
if force_refresh:
|
||||
self._connect(refresh=True)
|
||||
collection = self._ensure_collection()
|
||||
self._validate_schema(collection)
|
||||
self._log_collection_binding(collection, event="refreshed" if force_refresh else "initialized")
|
||||
return collection
|
||||
|
||||
def _ensure_collection(self) -> Collection:
|
||||
"""Handle ensure collection for this module for the Milvus Vector Index instance."""
|
||||
if utility.has_collection(self.collection_name):
|
||||
collection = Collection(self.collection_name)
|
||||
if utility.has_collection(self.collection_name, using=self.alias):
|
||||
collection = Collection(self.collection_name, using=self.alias)
|
||||
collection.load()
|
||||
return collection
|
||||
schema = CollectionSchema(
|
||||
fields=[
|
||||
FieldSchema(name="id", dtype=DataType.VARCHAR, max_length=128, is_primary=True, auto_id=False),
|
||||
FieldSchema(name="doc_id", dtype=DataType.VARCHAR, max_length=64),
|
||||
FieldSchema(name="doc_name", dtype=DataType.VARCHAR, max_length=256),
|
||||
FieldSchema(name="content", dtype=DataType.VARCHAR, max_length=65535),
|
||||
FieldSchema(name="doc_title", dtype=DataType.VARCHAR, max_length=256),
|
||||
FieldSchema(name="chunk_id", dtype=DataType.VARCHAR, max_length=128),
|
||||
FieldSchema(name="chunk_index", dtype=DataType.INT64),
|
||||
FieldSchema(name="piece_index", dtype=DataType.INT64),
|
||||
FieldSchema(name="text", dtype=DataType.VARCHAR, max_length=65535),
|
||||
FieldSchema(name="embedding_text", dtype=DataType.VARCHAR, max_length=65535),
|
||||
FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=settings.embedding_dim),
|
||||
FieldSchema(name="section_title", dtype=DataType.VARCHAR, max_length=512),
|
||||
FieldSchema(name="section_path", dtype=DataType.VARCHAR, max_length=4096),
|
||||
FieldSchema(name="page_number", dtype=DataType.INT64),
|
||||
FieldSchema(name="regulation_type", dtype=DataType.VARCHAR, max_length=128),
|
||||
FieldSchema(name="version", dtype=DataType.VARCHAR, max_length=64),
|
||||
FieldSchema(name="semantic_id", dtype=DataType.VARCHAR, max_length=128),
|
||||
FieldSchema(name="block_type", dtype=DataType.VARCHAR, max_length=64),
|
||||
FieldSchema(name="chunk_type", dtype=DataType.VARCHAR, max_length=64),
|
||||
FieldSchema(name="page_start", dtype=DataType.INT64),
|
||||
FieldSchema(name="page_end", dtype=DataType.INT64),
|
||||
FieldSchema(name="section_level", dtype=DataType.INT64),
|
||||
FieldSchema(name="source_ids", dtype=DataType.VARCHAR, max_length=4096),
|
||||
FieldSchema(name="section_path", dtype=DataType.VARCHAR, max_length=4096),
|
||||
FieldSchema(name="section_title", dtype=DataType.VARCHAR, max_length=512),
|
||||
FieldSchema(name="metadata_json", dtype=DataType.VARCHAR, max_length=65535),
|
||||
FieldSchema(name="created_at", dtype=DataType.INT64),
|
||||
],
|
||||
description="Dense-only regulations index",
|
||||
enable_dynamic_field=False,
|
||||
)
|
||||
collection = Collection(name=self.collection_name, schema=schema)
|
||||
collection = Collection(name=self.collection_name, schema=schema, using=self.alias)
|
||||
collection.create_index(
|
||||
field_name="embedding",
|
||||
index_params={
|
||||
@@ -73,21 +166,34 @@ class MilvusVectorIndex(VectorIndex):
|
||||
data = []
|
||||
now = int(time.time())
|
||||
for chunk, vector in zip(chunks, vectors):
|
||||
metadata = dict(chunk.metadata)
|
||||
doc_title = str(metadata.get("doc_title", chunk.doc_title))
|
||||
text = str(metadata.get("text", chunk.text))
|
||||
embedding_text = str(metadata.get("embedding_text", chunk.embedding_text))
|
||||
page_start = int(metadata.get("page_start", 0) or 0)
|
||||
page_end = int(metadata.get("page_end", 0) or 0)
|
||||
section_path = metadata.get("section_path", chunk.section_path)
|
||||
source_ids = metadata.get("source_ids", [])
|
||||
data.append(
|
||||
{
|
||||
"id": chunk.chunk_id,
|
||||
"doc_id": chunk.doc_id,
|
||||
"doc_name": chunk.doc_name,
|
||||
"content": chunk.content[:65535],
|
||||
"doc_title": doc_title[:256],
|
||||
"chunk_id": chunk.chunk_id[:128],
|
||||
"chunk_index": int(metadata.get("chunk_index", chunk.chunk_index) or 0),
|
||||
"piece_index": int(metadata.get("piece_index", chunk.piece_index) or 0),
|
||||
"text": text[:65535],
|
||||
"embedding_text": embedding_text[:65535],
|
||||
"embedding": vector,
|
||||
"section_title": chunk.section_title[:512],
|
||||
"section_path": json.dumps(chunk.section_path, ensure_ascii=False)[:4096],
|
||||
"page_number": chunk.page_number,
|
||||
"regulation_type": chunk.regulation_type[:128],
|
||||
"version": chunk.version[:64],
|
||||
"semantic_id": chunk.semantic_id[:128],
|
||||
"block_type": chunk.block_type[:64],
|
||||
"metadata_json": json.dumps(chunk.metadata, ensure_ascii=False)[:65535],
|
||||
"semantic_id": str(metadata.get("semantic_id", chunk.semantic_id))[:128],
|
||||
"chunk_type": str(metadata.get("chunk_type", chunk.chunk_type))[:64],
|
||||
"page_start": page_start,
|
||||
"page_end": page_end,
|
||||
"section_level": int(metadata.get("section_level", chunk.section_level) or 0),
|
||||
"source_ids": json.dumps(source_ids, ensure_ascii=False)[:4096],
|
||||
"section_path": json.dumps(section_path, ensure_ascii=False)[:4096],
|
||||
"section_title": str(metadata.get("section_title", chunk.section_title))[:512],
|
||||
"metadata_json": json.dumps(metadata, ensure_ascii=False)[:65535],
|
||||
"created_at": now,
|
||||
}
|
||||
)
|
||||
@@ -107,30 +213,73 @@ class MilvusVectorIndex(VectorIndex):
|
||||
|
||||
filters = filters.strip()
|
||||
|
||||
# Normalize legacy field names so callers can keep older filter payloads.
|
||||
replacements = {
|
||||
"doc_name": "doc_title",
|
||||
"content": "text",
|
||||
"page_number": "page_start",
|
||||
"block_type": "chunk_type",
|
||||
}
|
||||
for legacy_name, new_name in replacements.items():
|
||||
filters = filters.replace(legacy_name, new_name)
|
||||
|
||||
# Check if already a Milvus expression (contains operators)
|
||||
if any(op in filters for op in ["==", "!=", "in", "not in", ">", "<", ">=", "<=", "and", "or"]):
|
||||
return filters
|
||||
|
||||
# Parse simple regulation_type filter
|
||||
# Support: "GB" or "GB,UN-ECE" or "GB, UN-ECE"
|
||||
types = [t.strip() for t in filters.split(",") if t.strip()]
|
||||
# Parse simple document-title filter.
|
||||
titles = [title.strip() for title in filters.split(",") if title.strip()]
|
||||
|
||||
if not types:
|
||||
if not titles:
|
||||
return None
|
||||
|
||||
if len(types) == 1:
|
||||
# Single value: regulation_type == "GB"
|
||||
return f'regulation_type == "{types[0]}"'
|
||||
else:
|
||||
# Multiple values: regulation_type in ["GB", "UN-ECE"]
|
||||
quoted_types = [f'"{t}"' for t in types]
|
||||
return f'regulation_type in [{", ".join(quoted_types)}]'
|
||||
if len(titles) == 1:
|
||||
return f'doc_title == "{titles[0]}"'
|
||||
|
||||
quoted_titles = [f'"{title}"' for title in titles]
|
||||
return f'doc_title in [{", ".join(quoted_titles)}]'
|
||||
|
||||
def _should_refresh_after_exception(self, exc: Exception) -> bool:
|
||||
"""Return whether the Milvus error suggests stale connection or collection state."""
|
||||
if not isinstance(exc, MilvusException):
|
||||
return False
|
||||
normalized = str(exc).lower()
|
||||
return any(token in normalized for token in _SCHEMA_RECOVERY_TOKENS)
|
||||
|
||||
def _run_with_refresh(self, operation):
|
||||
"""Run a Milvus operation and retry once after a forced reconnect when appropriate."""
|
||||
try:
|
||||
return operation()
|
||||
except VectorStoreSchemaError:
|
||||
raise
|
||||
except Exception as exc:
|
||||
if not self._should_refresh_after_exception(exc):
|
||||
raise
|
||||
logger.warning(
|
||||
"Milvus operation failed for alias={} collection={}; forcing reconnect and retry: {}",
|
||||
self.alias,
|
||||
self.collection_name,
|
||||
exc,
|
||||
)
|
||||
self.collection = self._bind_collection(force_refresh=True)
|
||||
try:
|
||||
return operation()
|
||||
except VectorStoreSchemaError:
|
||||
raise
|
||||
except Exception as retry_exc:
|
||||
if isinstance(retry_exc, MilvusException):
|
||||
self._raise_schema_error(
|
||||
message=f"Milvus operation failed after refresh: {retry_exc}",
|
||||
actual_fields=self._schema_field_names(self.collection),
|
||||
)
|
||||
raise
|
||||
|
||||
def search(self, query_vector: list[float], top_k: int, filters: str | None = None) -> list[RetrievedChunk]:
|
||||
"""Handle search for the Milvus Vector Index instance."""
|
||||
milvus_expr = self._parse_filters(filters)
|
||||
|
||||
results = self.collection.search(
|
||||
results = self._run_with_refresh(
|
||||
lambda: self.collection.search(
|
||||
data=[query_vector],
|
||||
anns_field="embedding",
|
||||
param={"metric_type": "COSINE", "params": {"nprobe": settings.milvus_nprobe}},
|
||||
@@ -138,17 +287,24 @@ class MilvusVectorIndex(VectorIndex):
|
||||
expr=milvus_expr,
|
||||
output_fields=[
|
||||
"doc_id",
|
||||
"doc_name",
|
||||
"content",
|
||||
"doc_title",
|
||||
"chunk_id",
|
||||
"chunk_index",
|
||||
"piece_index",
|
||||
"text",
|
||||
"embedding_text",
|
||||
"section_title",
|
||||
"page_number",
|
||||
"regulation_type",
|
||||
"version",
|
||||
"semantic_id",
|
||||
"block_type",
|
||||
"chunk_type",
|
||||
"page_start",
|
||||
"page_end",
|
||||
"section_level",
|
||||
"source_ids",
|
||||
"section_path",
|
||||
"metadata_json",
|
||||
],
|
||||
)
|
||||
)
|
||||
payload: list[RetrievedChunk] = []
|
||||
for hits in results:
|
||||
for hit in hits:
|
||||
@@ -161,13 +317,18 @@ class MilvusVectorIndex(VectorIndex):
|
||||
metadata = {"raw_metadata": raw_metadata}
|
||||
payload.append(
|
||||
RetrievedChunk(
|
||||
chunk_id=str(hit.id),
|
||||
chunk_id=str(hit.entity.get("chunk_id", hit.id)),
|
||||
doc_id=hit.entity.get("doc_id", ""),
|
||||
doc_name=hit.entity.get("doc_name", ""),
|
||||
content=hit.entity.get("content", ""),
|
||||
doc_title=hit.entity.get("doc_title", ""),
|
||||
text=hit.entity.get("text", ""),
|
||||
score=float(hit.score),
|
||||
chunk_type=hit.entity.get("chunk_type", ""),
|
||||
section_title=hit.entity.get("section_title", ""),
|
||||
page_number=int(hit.entity.get("page_number", 0) or 0),
|
||||
page_start=int(hit.entity.get("page_start", 0) or 0),
|
||||
page_end=int(hit.entity.get("page_end", 0) or 0),
|
||||
section_level=int(hit.entity.get("section_level", 0) or 0),
|
||||
chunk_index=int(hit.entity.get("chunk_index", 0) or 0),
|
||||
piece_index=int(hit.entity.get("piece_index", 0) or 0),
|
||||
metadata=metadata,
|
||||
)
|
||||
)
|
||||
@@ -176,7 +337,9 @@ class MilvusVectorIndex(VectorIndex):
|
||||
def count_by_document(self) -> dict[str, int]:
|
||||
"""Return doc_id -> chunk count from Milvus."""
|
||||
try:
|
||||
rows = self.collection.query(expr="doc_id != \"\"", output_fields=["doc_id"])
|
||||
rows = self._run_with_refresh(
|
||||
lambda: self.collection.query(expr="doc_id != \"\"", output_fields=["doc_id", "doc_title"])
|
||||
)
|
||||
except Exception:
|
||||
return {}
|
||||
counts: dict[str, int] = {}
|
||||
@@ -189,9 +352,11 @@ class MilvusVectorIndex(VectorIndex):
|
||||
def list_document_metadata(self) -> list[dict]:
|
||||
"""Return one metadata row per document from Milvus (single query, no embeddings)."""
|
||||
try:
|
||||
rows = self.collection.query(
|
||||
rows = self._run_with_refresh(
|
||||
lambda: self.collection.query(
|
||||
expr="doc_id != \"\"",
|
||||
output_fields=["doc_id", "doc_name", "regulation_type", "version"],
|
||||
output_fields=["doc_id", "doc_title", "metadata_json"],
|
||||
)
|
||||
)
|
||||
except Exception:
|
||||
return []
|
||||
@@ -204,15 +369,26 @@ class MilvusVectorIndex(VectorIndex):
|
||||
continue
|
||||
counts[doc_id] = counts.get(doc_id, 0) + 1
|
||||
if doc_id not in seen:
|
||||
metadata: dict[str, object] = {}
|
||||
raw_metadata = row.get("metadata_json", "")
|
||||
if raw_metadata:
|
||||
try:
|
||||
metadata = json.loads(raw_metadata)
|
||||
except json.JSONDecodeError:
|
||||
metadata = {}
|
||||
seen[doc_id] = {
|
||||
"doc_id": doc_id,
|
||||
"doc_name": row.get("doc_name", ""),
|
||||
"regulation_type": row.get("regulation_type", ""),
|
||||
"version": row.get("version", ""),
|
||||
"doc_title": row.get("doc_title", ""),
|
||||
"regulation_type": str(metadata.get("regulation_type", "")),
|
||||
"version": str(metadata.get("version", "")),
|
||||
}
|
||||
|
||||
return [
|
||||
{**meta, "chunk_count": counts[meta["doc_id"]]}
|
||||
{
|
||||
**meta,
|
||||
"doc_name": meta.get("doc_title", ""),
|
||||
"chunk_count": counts[meta["doc_id"]],
|
||||
}
|
||||
for meta in seen.values()
|
||||
]
|
||||
|
||||
|
||||
@@ -67,14 +67,14 @@ class DocumentProcessor:
|
||||
return [
|
||||
{
|
||||
"id": item.chunk_id,
|
||||
"content": item.content,
|
||||
"content": item.text,
|
||||
"score": item.score,
|
||||
"metadata": {
|
||||
"doc_id": item.doc_id,
|
||||
"doc_name": item.doc_name,
|
||||
"doc_name": item.doc_title,
|
||||
"chunk_id": item.chunk_id,
|
||||
"section_title": item.section_title,
|
||||
"page_number": item.page_number,
|
||||
"page_number": item.page_start,
|
||||
**item.metadata,
|
||||
},
|
||||
}
|
||||
|
||||
@@ -3,29 +3,136 @@
|
||||
from __future__ import annotations
|
||||
|
||||
from functools import lru_cache
|
||||
from typing import Callable
|
||||
|
||||
from app.application.agent import AgentConversationService
|
||||
from app.application.agent import AgentConversationService, AgentSessionService
|
||||
from app.application.documents import DocumentCommandService, DocumentQueryService
|
||||
from app.application.knowledge import KnowledgeRetrievalService
|
||||
from app.application.perception.services import PerceptionService
|
||||
from app.config.settings import settings
|
||||
from app.domain.documents import DocumentBinaryStore
|
||||
from app.domain.retrieval import VectorIndex
|
||||
from app.infrastructure.embedding.openai_compatible_embedding_provider import OpenAICompatibleEmbeddingProvider
|
||||
from app.infrastructure.llm.openai_compatible_answer_generator import OpenAICompatibleAnswerGenerator
|
||||
from app.infrastructure.parser.aliyun_document_parser import AliyunDocumentParser
|
||||
from app.infrastructure.parser.local_chunk_builder import LocalRegulationChunkBuilder
|
||||
from app.infrastructure.parser.local_document_parser import LocalDocumentParser
|
||||
from app.infrastructure.parser.vector_chunk_builder import AliyunVectorChunkBuilder
|
||||
from app.infrastructure.perception.mock_event_store import MockEventStore
|
||||
from app.infrastructure.session.in_memory_conversation_store import InMemoryConversationStore
|
||||
from app.infrastructure.storage.json_document_processing_store import JsonDocumentProcessingStore
|
||||
from app.infrastructure.storage.json_document_repository import JsonDocumentRepository
|
||||
from app.infrastructure.storage.minio_binary_store import MinioDocumentBinaryStore
|
||||
from app.infrastructure.storage.postgres_document_processing_store import PostgresDocumentProcessingStore
|
||||
from app.infrastructure.storage.postgres_document_repository import PostgresDocumentRepository
|
||||
from app.infrastructure.storage.postgres_parse_artifact_store import PostgresParseArtifactStore
|
||||
from app.infrastructure.vectorstore.bm25_retriever import BM25Retriever
|
||||
from app.infrastructure.vectorstore.cross_encoder_reranker import OpenAICompatibleReranker
|
||||
from app.infrastructure.vectorstore.dense_retriever import DenseRetriever
|
||||
from app.infrastructure.vectorstore.milvus_vector_index import MilvusVectorIndex
|
||||
from app.infrastructure.vectorstore.cross_encoder_reranker import OpenAICompatibleReranker
|
||||
from app.services.llm.llm_factory import LLMFactory
|
||||
# Keep shared wiring centralized so dependency construction remains consistent.
|
||||
|
||||
|
||||
class LazyBinaryStore(DocumentBinaryStore):
|
||||
"""Delay MinIO connection work until binary storage is actually needed."""
|
||||
|
||||
def __init__(self, factory: Callable[[], DocumentBinaryStore]) -> None:
|
||||
"""Initialize the lazy binary store wrapper."""
|
||||
self._factory = factory
|
||||
self._store: DocumentBinaryStore | None = None
|
||||
|
||||
def _get_store(self) -> DocumentBinaryStore:
|
||||
"""Create the underlying store on first use and reuse it afterwards."""
|
||||
if self._store is None:
|
||||
self._store = self._factory()
|
||||
return self._store
|
||||
|
||||
@property
|
||||
def client(self):
|
||||
"""Expose the underlying client for compatibility with health endpoints."""
|
||||
return self._get_store().client
|
||||
|
||||
def save(
|
||||
self,
|
||||
*,
|
||||
object_name: str,
|
||||
data: bytes,
|
||||
content_type: str,
|
||||
metadata: dict[str, str] | None = None,
|
||||
) -> None:
|
||||
"""Save data through the underlying binary store implementation."""
|
||||
self._get_store().save(
|
||||
object_name=object_name,
|
||||
data=data,
|
||||
content_type=content_type,
|
||||
metadata=metadata,
|
||||
)
|
||||
|
||||
def read(self, object_name: str) -> bytes:
|
||||
"""Read data through the underlying binary store implementation."""
|
||||
return self._get_store().read(object_name)
|
||||
|
||||
def delete(self, object_name: str) -> None:
|
||||
"""Delete data through the underlying binary store implementation."""
|
||||
self._get_store().delete(object_name)
|
||||
|
||||
|
||||
class LazyVectorIndex(VectorIndex):
|
||||
"""Delay Milvus connection work until vector operations are actually needed."""
|
||||
|
||||
def __init__(self, factory: Callable[[], VectorIndex]) -> None:
|
||||
"""Initialize the lazy vector index wrapper."""
|
||||
self._factory = factory
|
||||
self._index: VectorIndex | None = None
|
||||
|
||||
def _get_index(self) -> VectorIndex:
|
||||
"""Create the underlying index on first use and reuse it afterwards."""
|
||||
if self._index is None:
|
||||
self._index = self._factory()
|
||||
return self._index
|
||||
|
||||
@property
|
||||
def collection(self):
|
||||
"""Expose the underlying Milvus collection for compatibility adapters."""
|
||||
return self._get_index().collection
|
||||
|
||||
def upsert(self, chunks, vectors) -> int:
|
||||
"""Insert or update vectors through the underlying vector index implementation."""
|
||||
return self._get_index().upsert(chunks, vectors)
|
||||
|
||||
def delete_by_document(self, doc_id: str) -> int:
|
||||
"""Delete vectors through the underlying vector index implementation."""
|
||||
return self._get_index().delete_by_document(doc_id)
|
||||
|
||||
def search(self, query_vector: list[float], top_k: int, filters: str | None = None):
|
||||
"""Search vectors through the underlying vector index implementation."""
|
||||
return self._get_index().search(query_vector, top_k, filters)
|
||||
|
||||
def count_by_document(self) -> dict[str, int]:
|
||||
"""Count document vectors through the underlying vector index implementation."""
|
||||
return self._get_index().count_by_document()
|
||||
|
||||
def list_document_metadata(self) -> list[dict]:
|
||||
"""List document metadata through the underlying vector index implementation."""
|
||||
return self._get_index().list_document_metadata()
|
||||
|
||||
def health(self) -> dict:
|
||||
"""Return vector index health through the underlying vector index implementation."""
|
||||
return self._get_index().health()
|
||||
|
||||
|
||||
@lru_cache
|
||||
def _build_binary_store() -> MinioDocumentBinaryStore:
|
||||
"""Return the concrete binary store implementation."""
|
||||
return MinioDocumentBinaryStore()
|
||||
|
||||
|
||||
@lru_cache
|
||||
def _build_vector_index() -> MilvusVectorIndex:
|
||||
"""Return the concrete vector index implementation."""
|
||||
return MilvusVectorIndex()
|
||||
|
||||
|
||||
@lru_cache
|
||||
def get_document_repository():
|
||||
@@ -44,9 +151,17 @@ def get_parse_artifact_store():
|
||||
|
||||
|
||||
@lru_cache
|
||||
def get_binary_store() -> MinioDocumentBinaryStore:
|
||||
def get_document_processing_store():
|
||||
"""Return document processing store for the active repository backend."""
|
||||
if settings.document_repository_backend == "postgres":
|
||||
return PostgresDocumentProcessingStore()
|
||||
return JsonDocumentProcessingStore(settings.document_processing_metadata_path)
|
||||
|
||||
|
||||
@lru_cache
|
||||
def get_binary_store() -> DocumentBinaryStore:
|
||||
"""Return binary store."""
|
||||
return MinioDocumentBinaryStore()
|
||||
return LazyBinaryStore(_build_binary_store)
|
||||
|
||||
|
||||
@lru_cache
|
||||
@@ -75,9 +190,9 @@ def get_embedding_provider() -> OpenAICompatibleEmbeddingProvider:
|
||||
|
||||
|
||||
@lru_cache
|
||||
def get_vector_index() -> MilvusVectorIndex:
|
||||
def get_vector_index() -> VectorIndex:
|
||||
"""Return vector index."""
|
||||
return MilvusVectorIndex()
|
||||
return LazyVectorIndex(_build_vector_index)
|
||||
|
||||
|
||||
@lru_cache
|
||||
@@ -121,6 +236,7 @@ def get_document_command_service() -> DocumentCommandService:
|
||||
embedding_provider=get_embedding_provider(),
|
||||
vector_index=get_vector_index(),
|
||||
parse_artifact_store=get_parse_artifact_store(),
|
||||
document_processing_store=get_document_processing_store(),
|
||||
)
|
||||
|
||||
|
||||
@@ -151,3 +267,28 @@ def get_agent_conversation_service() -> AgentConversationService:
|
||||
answer_generator=OpenAICompatibleAnswerGenerator(),
|
||||
conversation_store=get_conversation_store(),
|
||||
)
|
||||
|
||||
|
||||
@lru_cache
|
||||
def get_perception_service() -> PerceptionService:
|
||||
"""Return perception service for regulatory intelligence."""
|
||||
return PerceptionService(
|
||||
event_store=MockEventStore(),
|
||||
retrieval_service=get_retrieval_service(),
|
||||
)
|
||||
|
||||
|
||||
@lru_cache
|
||||
def get_agent_session_service() -> AgentSessionService:
|
||||
"""Return agent session service."""
|
||||
return AgentSessionService(conversation_store=get_conversation_store())
|
||||
|
||||
|
||||
def preload_runtime_dependencies() -> None:
|
||||
"""Warm dependencies that are safe and useful to preload during startup."""
|
||||
LLMFactory.preload_clients(["qwen", "deepseek"])
|
||||
|
||||
|
||||
def cleanup_runtime_dependencies() -> None:
|
||||
"""Release runtime dependencies that expose explicit cleanup hooks."""
|
||||
LLMFactory.cleanup()
|
||||
|
||||
30
backend/app/shared/errors.py
Normal file
30
backend/app/shared/errors.py
Normal file
@@ -0,0 +1,30 @@
|
||||
"""Define shared backend exception types."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
|
||||
class VectorStoreSchemaError(RuntimeError):
|
||||
"""Signal that the active vector store schema does not match backend expectations."""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
*,
|
||||
message: str,
|
||||
host: str,
|
||||
db_name: str,
|
||||
collection_name: str,
|
||||
expected_fields: list[str],
|
||||
actual_fields: list[str],
|
||||
) -> None:
|
||||
"""Initialize the vector store schema error details."""
|
||||
self.host = host
|
||||
self.db_name = db_name
|
||||
self.collection_name = collection_name
|
||||
self.expected_fields = expected_fields
|
||||
self.actual_fields = actual_fields
|
||||
# Keep the message self-contained so runtime logs show the full mismatch context.
|
||||
details = (
|
||||
f"{message} | host={host} db={db_name} collection={collection_name} "
|
||||
f"expected_fields={expected_fields} actual_fields={actual_fields}"
|
||||
)
|
||||
super().__init__(details)
|
||||
@@ -1 +0,0 @@
|
||||
{}
|
||||
131
backend/data/document_processing.json
Normal file
131
backend/data/document_processing.json
Normal file
@@ -0,0 +1,131 @@
|
||||
{
|
||||
"runs": {
|
||||
"8e722053-5009-40fe-a483-535b40ebbb16": {
|
||||
"run_id": "8e722053-5009-40fe-a483-535b40ebbb16",
|
||||
"doc_id": "7cbdfe3c",
|
||||
"trigger_type": "upload",
|
||||
"run_status": "succeeded",
|
||||
"parser_backend": "aliyun_docmind",
|
||||
"chunk_backend": "aliyun",
|
||||
"embedding_model": "text-embedding-v3",
|
||||
"index_name": "regulations_dense_1024_v2",
|
||||
"started_at": "2026-05-26T12:18:27.208692+00:00",
|
||||
"stored_at": "2026-05-26T12:18:27.712855+00:00",
|
||||
"parsed_at": "2026-05-26T12:18:42.989238+00:00",
|
||||
"indexed_at": "2026-05-26T12:18:51.172418+00:00",
|
||||
"finished_at": "2026-05-26T12:18:51.172418+00:00",
|
||||
"layout_count": 48,
|
||||
"structure_node_count": 6,
|
||||
"semantic_block_count": 33,
|
||||
"vector_chunk_count": 34,
|
||||
"chunk_count": 34,
|
||||
"failure_stage": "",
|
||||
"error_message": "",
|
||||
"metadata": {
|
||||
"generate_summary": true,
|
||||
"parse_task_id": "docmind-20260526-10b94713ccb348498b12180a5dcf32ff"
|
||||
}
|
||||
}
|
||||
},
|
||||
"status_events": {
|
||||
"d0532baf-0d65-4130-b282-ec51f04132fd": {
|
||||
"event_id": "d0532baf-0d65-4130-b282-ec51f04132fd",
|
||||
"doc_id": "7cbdfe3c",
|
||||
"run_id": "8e722053-5009-40fe-a483-535b40ebbb16",
|
||||
"from_status": "",
|
||||
"to_status": "pending",
|
||||
"stage": "document_created",
|
||||
"message": "Document record created",
|
||||
"metadata": {},
|
||||
"occurred_at": "2026-05-26T12:18:27.235921+00:00"
|
||||
},
|
||||
"a5e32db5-25c3-4c73-a987-7311f0e72a31": {
|
||||
"event_id": "a5e32db5-25c3-4c73-a987-7311f0e72a31",
|
||||
"doc_id": "7cbdfe3c",
|
||||
"run_id": "8e722053-5009-40fe-a483-535b40ebbb16",
|
||||
"from_status": "pending",
|
||||
"to_status": "stored",
|
||||
"stage": "store",
|
||||
"message": "Source file stored",
|
||||
"metadata": {},
|
||||
"occurred_at": "2026-05-26T12:18:27.741462+00:00"
|
||||
},
|
||||
"18e04ce7-9d7a-4008-8600-e2590100bd85": {
|
||||
"event_id": "18e04ce7-9d7a-4008-8600-e2590100bd85",
|
||||
"doc_id": "7cbdfe3c",
|
||||
"run_id": "8e722053-5009-40fe-a483-535b40ebbb16",
|
||||
"from_status": "stored",
|
||||
"to_status": "parsed",
|
||||
"stage": "parse",
|
||||
"message": "Document parsed",
|
||||
"metadata": {
|
||||
"artifact_count": 4
|
||||
},
|
||||
"occurred_at": "2026-05-26T12:18:43.218026+00:00"
|
||||
},
|
||||
"d3b06025-5c91-4a42-9e5f-dce1c5312b96": {
|
||||
"event_id": "d3b06025-5c91-4a42-9e5f-dce1c5312b96",
|
||||
"doc_id": "7cbdfe3c",
|
||||
"run_id": "8e722053-5009-40fe-a483-535b40ebbb16",
|
||||
"from_status": "parsed",
|
||||
"to_status": "indexed",
|
||||
"stage": "index",
|
||||
"message": "Document indexed",
|
||||
"metadata": {
|
||||
"chunk_count": 34,
|
||||
"index_name": "regulations_dense_1024_v2"
|
||||
},
|
||||
"occurred_at": "2026-05-26T12:18:51.195442+00:00"
|
||||
}
|
||||
},
|
||||
"artifacts": {
|
||||
"47fe2877-a8f5-4e1d-901b-80cd0194ba96": {
|
||||
"artifact_id": "47fe2877-a8f5-4e1d-901b-80cd0194ba96",
|
||||
"doc_id": "7cbdfe3c",
|
||||
"run_id": "8e722053-5009-40fe-a483-535b40ebbb16",
|
||||
"artifact_type": "layouts",
|
||||
"object_name": "artifacts/7cbdfe3c/layouts.json",
|
||||
"content_type": "application/json",
|
||||
"byte_size": 0,
|
||||
"checksum": "",
|
||||
"metadata": {},
|
||||
"created_at": "2026-05-26T12:18:43.188467+00:00"
|
||||
},
|
||||
"44aa075b-86b2-48a7-9d14-a2453bd53863": {
|
||||
"artifact_id": "44aa075b-86b2-48a7-9d14-a2453bd53863",
|
||||
"doc_id": "7cbdfe3c",
|
||||
"run_id": "8e722053-5009-40fe-a483-535b40ebbb16",
|
||||
"artifact_type": "structure_nodes",
|
||||
"object_name": "artifacts/7cbdfe3c/structure_nodes.json",
|
||||
"content_type": "application/json",
|
||||
"byte_size": 0,
|
||||
"checksum": "",
|
||||
"metadata": {},
|
||||
"created_at": "2026-05-26T12:18:43.188494+00:00"
|
||||
},
|
||||
"dedcc8fe-fa58-4de6-984d-f44332af5204": {
|
||||
"artifact_id": "dedcc8fe-fa58-4de6-984d-f44332af5204",
|
||||
"doc_id": "7cbdfe3c",
|
||||
"run_id": "8e722053-5009-40fe-a483-535b40ebbb16",
|
||||
"artifact_type": "semantic_blocks",
|
||||
"object_name": "artifacts/7cbdfe3c/semantic_blocks.json",
|
||||
"content_type": "application/json",
|
||||
"byte_size": 0,
|
||||
"checksum": "",
|
||||
"metadata": {},
|
||||
"created_at": "2026-05-26T12:18:43.188511+00:00"
|
||||
},
|
||||
"9b0d8bda-e69e-4a4e-ae06-a308afe43109": {
|
||||
"artifact_id": "9b0d8bda-e69e-4a4e-ae06-a308afe43109",
|
||||
"doc_id": "7cbdfe3c",
|
||||
"run_id": "8e722053-5009-40fe-a483-535b40ebbb16",
|
||||
"artifact_type": "vector_chunks",
|
||||
"object_name": "artifacts/7cbdfe3c/vector_chunks.json",
|
||||
"content_type": "application/json",
|
||||
"byte_size": 0,
|
||||
"checksum": "",
|
||||
"metadata": {},
|
||||
"created_at": "2026-05-26T12:18:43.188526+00:00"
|
||||
}
|
||||
}
|
||||
}
|
||||
@@ -1,385 +1,38 @@
|
||||
{
|
||||
"69280841": {
|
||||
"doc_id": "69280841",
|
||||
"doc_name": "TCT算法接口.pdf",
|
||||
"file_name": "TCT算法接口.pdf",
|
||||
"object_name": "69280841/TCT算法接口.pdf",
|
||||
"content_type": "application/pdf",
|
||||
"size_bytes": 165557,
|
||||
"status": "failed",
|
||||
"regulation_type": "",
|
||||
"version": "",
|
||||
"summary": "",
|
||||
"summary_latency_ms": 0,
|
||||
"chunk_count": 0,
|
||||
"parser_name": "local_markdown_parser",
|
||||
"index_name": "",
|
||||
"error_message": "embedding 维度不匹配,期望 1536",
|
||||
"created_at": "2026-05-18T07:12:16.668306+00:00",
|
||||
"updated_at": "2026-05-18T07:12:19.417142+00:00",
|
||||
"metadata": {
|
||||
"generate_summary": true,
|
||||
"structure_nodes": 0
|
||||
}
|
||||
},
|
||||
"44121fbb": {
|
||||
"doc_id": "44121fbb",
|
||||
"doc_name": "大众汽车手册.pdf",
|
||||
"file_name": "大众汽车手册.pdf",
|
||||
"object_name": "44121fbb/大众汽车手册.pdf",
|
||||
"content_type": "application/pdf",
|
||||
"size_bytes": 766565,
|
||||
"status": "failed",
|
||||
"regulation_type": "",
|
||||
"version": "",
|
||||
"summary": "",
|
||||
"summary_latency_ms": 0,
|
||||
"chunk_count": 0,
|
||||
"parser_name": "",
|
||||
"index_name": "",
|
||||
"error_message": "unable to load credentials from any of the providers in the chain: ['EnvironmentVariableCredentialsProvider: Environment variable accessKeyId cannot be empty', 'CLIProfileCredentialsProvider: unable to open credentials file: C:\\\\Users\\\\A200477427\\\\.aliyun/config.json', 'ProfileCredentialsProvider: failed to get credential from credentials file: $C:\\\\Users\\\\A200477427\\\\.alibabacloud/credentials.ini', \"EcsRamRoleCredentialsProvider: HTTPConnectionPool(host='100.100.100.200', port=80): Max retries exceeded with url: /latest/meta-data/ram/security-credentials/ (Caused by ConnectTimeoutError(<HTTPConnection(host='100.100.100.200', port=80) at 0x2614a5cb9d0>, 'Connection to 100.100.100.200 timed out. (connect timeout=1.0)'))\"]",
|
||||
"created_at": "2026-05-18T09:53:47.996183+00:00",
|
||||
"updated_at": "2026-05-18T09:53:50.825868+00:00",
|
||||
"metadata": {
|
||||
"generate_summary": true,
|
||||
"failure_reason": "unable to load credentials from any of the providers in the chain: ['EnvironmentVariableCredentialsProvider: Environment variable accessKeyId cannot be empty', 'CLIProfileCredentialsProvider: unable to open credentials file: C:\\\\Users\\\\A200477427\\\\.aliyun/config.json', 'ProfileCredentialsProvider: failed to get credential from credentials file: $C:\\\\Users\\\\A200477427\\\\.alibabacloud/credentials.ini', \"EcsRamRoleCredentialsProvider: HTTPConnectionPool(host='100.100.100.200', port=80): Max retries exceeded with url: /latest/meta-data/ram/security-credentials/ (Caused by ConnectTimeoutError(<HTTPConnection(host='100.100.100.200', port=80) at 0x2614a5cb9d0>, 'Connection to 100.100.100.200 timed out. (connect timeout=1.0)'))\"]",
|
||||
"processing_stage": "failed"
|
||||
}
|
||||
},
|
||||
"77debb4a": {
|
||||
"doc_id": "77debb4a",
|
||||
"doc_name": "大众汽车手册.pdf",
|
||||
"file_name": "大众汽车手册.pdf",
|
||||
"object_name": "77debb4a/大众汽车手册.pdf",
|
||||
"content_type": "application/pdf",
|
||||
"size_bytes": 766565,
|
||||
"status": "failed",
|
||||
"regulation_type": "",
|
||||
"version": "",
|
||||
"summary": "",
|
||||
"summary_latency_ms": 0,
|
||||
"chunk_count": 0,
|
||||
"parser_name": "",
|
||||
"index_name": "",
|
||||
"error_message": "unable to load credentials from any of the providers in the chain: ['EnvironmentVariableCredentialsProvider: Environment variable accessKeyId cannot be empty', 'CLIProfileCredentialsProvider: unable to open credentials file: C:\\\\Users\\\\A200477427\\\\.aliyun/config.json', 'ProfileCredentialsProvider: failed to get credential from credentials file: $C:\\\\Users\\\\A200477427\\\\.alibabacloud/credentials.ini', \"EcsRamRoleCredentialsProvider: HTTPConnectionPool(host='100.100.100.200', port=80): Max retries exceeded with url: /latest/meta-data/ram/security-credentials/ (Caused by ConnectTimeoutError(<HTTPConnection(host='100.100.100.200', port=80) at 0x2614a6dd480>, 'Connection to 100.100.100.200 timed out. (connect timeout=1.0)'))\"]",
|
||||
"created_at": "2026-05-18T10:05:46.104259+00:00",
|
||||
"updated_at": "2026-05-18T10:05:48.704061+00:00",
|
||||
"metadata": {
|
||||
"generate_summary": true,
|
||||
"failure_reason": "unable to load credentials from any of the providers in the chain: ['EnvironmentVariableCredentialsProvider: Environment variable accessKeyId cannot be empty', 'CLIProfileCredentialsProvider: unable to open credentials file: C:\\\\Users\\\\A200477427\\\\.aliyun/config.json', 'ProfileCredentialsProvider: failed to get credential from credentials file: $C:\\\\Users\\\\A200477427\\\\.alibabacloud/credentials.ini', \"EcsRamRoleCredentialsProvider: HTTPConnectionPool(host='100.100.100.200', port=80): Max retries exceeded with url: /latest/meta-data/ram/security-credentials/ (Caused by ConnectTimeoutError(<HTTPConnection(host='100.100.100.200', port=80) at 0x2614a6dd480>, 'Connection to 100.100.100.200 timed out. (connect timeout=1.0)'))\"]",
|
||||
"processing_stage": "failed"
|
||||
}
|
||||
},
|
||||
"d12bdcc8": {
|
||||
"doc_id": "d12bdcc8",
|
||||
"doc_name": "TCT算法接口.pdf",
|
||||
"file_name": "TCT算法接口.pdf",
|
||||
"object_name": "d12bdcc8/TCT算法接口.pdf",
|
||||
"content_type": "application/pdf",
|
||||
"size_bytes": 165557,
|
||||
"status": "failed",
|
||||
"regulation_type": "",
|
||||
"version": "",
|
||||
"summary": "",
|
||||
"summary_latency_ms": 0,
|
||||
"chunk_count": 0,
|
||||
"parser_name": "",
|
||||
"index_name": "",
|
||||
"error_message": "unable to load credentials from any of the providers in the chain: ['EnvironmentVariableCredentialsProvider: Environment variable accessKeyId cannot be empty', 'CLIProfileCredentialsProvider: unable to open credentials file: C:\\\\Users\\\\A200477427\\\\.aliyun/config.json', 'ProfileCredentialsProvider: failed to get credential from credentials file: $C:\\\\Users\\\\A200477427\\\\.alibabacloud/credentials.ini', \"EcsRamRoleCredentialsProvider: HTTPConnectionPool(host='100.100.100.200', port=80): Max retries exceeded with url: /latest/meta-data/ram/security-credentials/ (Caused by ConnectTimeoutError(<HTTPConnection(host='100.100.100.200', port=80) at 0x2614a5bf570>, 'Connection to 100.100.100.200 timed out. (connect timeout=1.0)'))\"]",
|
||||
"created_at": "2026-05-18T10:07:22.199824+00:00",
|
||||
"updated_at": "2026-05-18T10:07:24.653751+00:00",
|
||||
"metadata": {
|
||||
"generate_summary": true,
|
||||
"failure_reason": "unable to load credentials from any of the providers in the chain: ['EnvironmentVariableCredentialsProvider: Environment variable accessKeyId cannot be empty', 'CLIProfileCredentialsProvider: unable to open credentials file: C:\\\\Users\\\\A200477427\\\\.aliyun/config.json', 'ProfileCredentialsProvider: failed to get credential from credentials file: $C:\\\\Users\\\\A200477427\\\\.alibabacloud/credentials.ini', \"EcsRamRoleCredentialsProvider: HTTPConnectionPool(host='100.100.100.200', port=80): Max retries exceeded with url: /latest/meta-data/ram/security-credentials/ (Caused by ConnectTimeoutError(<HTTPConnection(host='100.100.100.200', port=80) at 0x2614a5bf570>, 'Connection to 100.100.100.200 timed out. (connect timeout=1.0)'))\"]",
|
||||
"processing_stage": "failed"
|
||||
}
|
||||
},
|
||||
"3c2e8c9c": {
|
||||
"doc_id": "3c2e8c9c",
|
||||
"doc_name": "20260415_Continental tire mobile app solution.pdf",
|
||||
"file_name": "20260415_Continental tire mobile app solution.pdf",
|
||||
"object_name": "3c2e8c9c/20260415_Continental tire mobile app solution.pdf",
|
||||
"content_type": "application/pdf",
|
||||
"size_bytes": 2178074,
|
||||
"status": "failed",
|
||||
"regulation_type": "",
|
||||
"version": "",
|
||||
"summary": "",
|
||||
"summary_latency_ms": 0,
|
||||
"chunk_count": 0,
|
||||
"parser_name": "",
|
||||
"index_name": "",
|
||||
"error_message": "unable to load credentials from any of the providers in the chain: ['EnvironmentVariableCredentialsProvider: Environment variable accessKeyId cannot be empty', 'CLIProfileCredentialsProvider: unable to open credentials file: C:\\\\Users\\\\A200477427\\\\.aliyun/config.json', 'ProfileCredentialsProvider: failed to get credential from credentials file: $C:\\\\Users\\\\A200477427\\\\.alibabacloud/credentials.ini', \"EcsRamRoleCredentialsProvider: HTTPConnectionPool(host='100.100.100.200', port=80): Max retries exceeded with url: /latest/meta-data/ram/security-credentials/ (Caused by ConnectTimeoutError(<HTTPConnection(host='100.100.100.200', port=80) at 0x2614a5bc8d0>, 'Connection to 100.100.100.200 timed out. (connect timeout=1.0)'))\"]",
|
||||
"created_at": "2026-05-18T10:09:58.338274+00:00",
|
||||
"updated_at": "2026-05-18T10:10:01.295502+00:00",
|
||||
"metadata": {
|
||||
"generate_summary": true,
|
||||
"failure_reason": "unable to load credentials from any of the providers in the chain: ['EnvironmentVariableCredentialsProvider: Environment variable accessKeyId cannot be empty', 'CLIProfileCredentialsProvider: unable to open credentials file: C:\\\\Users\\\\A200477427\\\\.aliyun/config.json', 'ProfileCredentialsProvider: failed to get credential from credentials file: $C:\\\\Users\\\\A200477427\\\\.alibabacloud/credentials.ini', \"EcsRamRoleCredentialsProvider: HTTPConnectionPool(host='100.100.100.200', port=80): Max retries exceeded with url: /latest/meta-data/ram/security-credentials/ (Caused by ConnectTimeoutError(<HTTPConnection(host='100.100.100.200', port=80) at 0x2614a5bc8d0>, 'Connection to 100.100.100.200 timed out. (connect timeout=1.0)'))\"]",
|
||||
"processing_stage": "failed"
|
||||
}
|
||||
},
|
||||
"d22d21a0": {
|
||||
"doc_id": "d22d21a0",
|
||||
"doc_name": "20260415_Continental tire mobile app solution.pdf",
|
||||
"file_name": "20260415_Continental tire mobile app solution.pdf",
|
||||
"object_name": "d22d21a0/20260415_Continental tire mobile app solution.pdf",
|
||||
"content_type": "application/pdf",
|
||||
"size_bytes": 2178074,
|
||||
"status": "failed",
|
||||
"regulation_type": "",
|
||||
"version": "",
|
||||
"summary": "",
|
||||
"summary_latency_ms": 0,
|
||||
"chunk_count": 0,
|
||||
"parser_name": "",
|
||||
"index_name": "",
|
||||
"error_message": "unable to load credentials from any of the providers in the chain: ['EnvironmentVariableCredentialsProvider: Environment variable accessKeyId cannot be empty', 'CLIProfileCredentialsProvider: unable to open credentials file: C:\\\\Users\\\\A200477427\\\\.aliyun/config.json', 'ProfileCredentialsProvider: failed to get credential from credentials file: $C:\\\\Users\\\\A200477427\\\\.alibabacloud/credentials.ini', \"EcsRamRoleCredentialsProvider: HTTPConnectionPool(host='100.100.100.200', port=80): Max retries exceeded with url: /latest/meta-data/ram/security-credentials/ (Caused by ConnectTimeoutError(<HTTPConnection(host='100.100.100.200', port=80) at 0x2614b994160>, 'Connection to 100.100.100.200 timed out. (connect timeout=1.0)'))\"]",
|
||||
"created_at": "2026-05-18T10:12:20.078027+00:00",
|
||||
"updated_at": "2026-05-18T10:12:22.999843+00:00",
|
||||
"metadata": {
|
||||
"generate_summary": true,
|
||||
"failure_reason": "unable to load credentials from any of the providers in the chain: ['EnvironmentVariableCredentialsProvider: Environment variable accessKeyId cannot be empty', 'CLIProfileCredentialsProvider: unable to open credentials file: C:\\\\Users\\\\A200477427\\\\.aliyun/config.json', 'ProfileCredentialsProvider: failed to get credential from credentials file: $C:\\\\Users\\\\A200477427\\\\.alibabacloud/credentials.ini', \"EcsRamRoleCredentialsProvider: HTTPConnectionPool(host='100.100.100.200', port=80): Max retries exceeded with url: /latest/meta-data/ram/security-credentials/ (Caused by ConnectTimeoutError(<HTTPConnection(host='100.100.100.200', port=80) at 0x2614b994160>, 'Connection to 100.100.100.200 timed out. (connect timeout=1.0)'))\"]",
|
||||
"processing_stage": "failed"
|
||||
}
|
||||
},
|
||||
"35f129d3": {
|
||||
"doc_id": "35f129d3",
|
||||
"doc_name": "大众汽车手册.pdf",
|
||||
"file_name": "大众汽车手册.pdf",
|
||||
"object_name": "35f129d3/大众汽车手册.pdf",
|
||||
"content_type": "application/pdf",
|
||||
"size_bytes": 766565,
|
||||
"status": "failed",
|
||||
"regulation_type": "",
|
||||
"version": "",
|
||||
"summary": "",
|
||||
"summary_latency_ms": 0,
|
||||
"chunk_count": 0,
|
||||
"parser_name": "",
|
||||
"index_name": "",
|
||||
"error_message": "unable to load credentials from any of the providers in the chain: ['EnvironmentVariableCredentialsProvider: Environment variable accessKeyId cannot be empty', 'CLIProfileCredentialsProvider: unable to open credentials file: C:\\\\Users\\\\A200477427\\\\.aliyun/config.json', 'ProfileCredentialsProvider: failed to get credential from credentials file: $C:\\\\Users\\\\A200477427\\\\.alibabacloud/credentials.ini', \"EcsRamRoleCredentialsProvider: HTTPConnectionPool(host='100.100.100.200', port=80): Max retries exceeded with url: /latest/meta-data/ram/security-credentials/ (Caused by ConnectTimeoutError(<HTTPConnection(host='100.100.100.200', port=80) at 0x2614b995370>, 'Connection to 100.100.100.200 timed out. (connect timeout=1.0)'))\"]",
|
||||
"created_at": "2026-05-18T10:13:24.706512+00:00",
|
||||
"updated_at": "2026-05-18T10:13:27.180509+00:00",
|
||||
"metadata": {
|
||||
"generate_summary": true,
|
||||
"failure_reason": "unable to load credentials from any of the providers in the chain: ['EnvironmentVariableCredentialsProvider: Environment variable accessKeyId cannot be empty', 'CLIProfileCredentialsProvider: unable to open credentials file: C:\\\\Users\\\\A200477427\\\\.aliyun/config.json', 'ProfileCredentialsProvider: failed to get credential from credentials file: $C:\\\\Users\\\\A200477427\\\\.alibabacloud/credentials.ini', \"EcsRamRoleCredentialsProvider: HTTPConnectionPool(host='100.100.100.200', port=80): Max retries exceeded with url: /latest/meta-data/ram/security-credentials/ (Caused by ConnectTimeoutError(<HTTPConnection(host='100.100.100.200', port=80) at 0x2614b995370>, 'Connection to 100.100.100.200 timed out. (connect timeout=1.0)'))\"]",
|
||||
"processing_stage": "failed"
|
||||
}
|
||||
},
|
||||
"efc21515": {
|
||||
"doc_id": "efc21515",
|
||||
"doc_name": "大众汽车手册.pdf",
|
||||
"file_name": "大众汽车手册.pdf",
|
||||
"object_name": "efc21515/大众汽车手册.pdf",
|
||||
"content_type": "application/pdf",
|
||||
"size_bytes": 766565,
|
||||
"status": "failed",
|
||||
"regulation_type": "",
|
||||
"version": "",
|
||||
"summary": "",
|
||||
"summary_latency_ms": 0,
|
||||
"chunk_count": 0,
|
||||
"parser_name": "aliyun_docmind",
|
||||
"index_name": "",
|
||||
"error_message": "Client error '400 Bad Request' for url 'http://6.86.80.4:30080/v1/embeddings'\nFor more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/400",
|
||||
"created_at": "2026-05-18T13:47:32.076786+00:00",
|
||||
"updated_at": "2026-05-18T13:47:57.998073+00:00",
|
||||
"metadata": {
|
||||
"generate_summary": true,
|
||||
"parser_backend": "aliyun_docmind",
|
||||
"parse_task_id": "docmind-20260518-a6e84447457f43cb85f95225cfc6495b",
|
||||
"layout_count": 87,
|
||||
"structure_node_count": 20,
|
||||
"semantic_block_count": 27,
|
||||
"vector_chunk_count": 27,
|
||||
"artifact_keys": {
|
||||
"layouts": "artifacts/efc21515/layouts.json",
|
||||
"structure_nodes": "artifacts/efc21515/structure_nodes.json",
|
||||
"semantic_blocks": "artifacts/efc21515/semantic_blocks.json",
|
||||
"vector_chunks": "artifacts/efc21515/vector_chunks.json"
|
||||
},
|
||||
"processing_stage": "failed",
|
||||
"failure_reason": "Client error '400 Bad Request' for url 'http://6.86.80.4:30080/v1/embeddings'\nFor more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/400"
|
||||
}
|
||||
},
|
||||
"0d4b08bc": {
|
||||
"doc_id": "0d4b08bc",
|
||||
"doc_name": "大众汽车手册.pdf",
|
||||
"file_name": "大众汽车手册.pdf",
|
||||
"object_name": "0d4b08bc/大众汽车手册.pdf",
|
||||
"content_type": "application/pdf",
|
||||
"size_bytes": 766565,
|
||||
"status": "failed",
|
||||
"regulation_type": "",
|
||||
"version": "",
|
||||
"summary": "",
|
||||
"summary_latency_ms": 0,
|
||||
"chunk_count": 0,
|
||||
"parser_name": "aliyun_docmind",
|
||||
"index_name": "",
|
||||
"error_message": "Client error '404 Not Found' for url 'http://6.86.80.4:30080/v1/embeddings'\nFor more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/404",
|
||||
"created_at": "2026-05-18T14:03:15.134344+00:00",
|
||||
"updated_at": "2026-05-18T14:03:34.843448+00:00",
|
||||
"metadata": {
|
||||
"generate_summary": true,
|
||||
"parser_backend": "aliyun_docmind",
|
||||
"parse_task_id": "docmind-20260518-78353d85daa24147b68d8fb71895179f",
|
||||
"layout_count": 87,
|
||||
"structure_node_count": 20,
|
||||
"semantic_block_count": 27,
|
||||
"vector_chunk_count": 27,
|
||||
"artifact_keys": {
|
||||
"layouts": "artifacts/0d4b08bc/layouts.json",
|
||||
"structure_nodes": "artifacts/0d4b08bc/structure_nodes.json",
|
||||
"semantic_blocks": "artifacts/0d4b08bc/semantic_blocks.json",
|
||||
"vector_chunks": "artifacts/0d4b08bc/vector_chunks.json"
|
||||
},
|
||||
"processing_stage": "failed",
|
||||
"failure_reason": "Client error '404 Not Found' for url 'http://6.86.80.4:30080/v1/embeddings'\nFor more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/404"
|
||||
}
|
||||
},
|
||||
"4302f314": {
|
||||
"doc_id": "4302f314",
|
||||
"doc_name": "大众汽车手册.pdf",
|
||||
"file_name": "大众汽车手册.pdf",
|
||||
"object_name": "4302f314/大众汽车手册.pdf",
|
||||
"content_type": "application/pdf",
|
||||
"size_bytes": 766565,
|
||||
"status": "failed",
|
||||
"regulation_type": "",
|
||||
"version": "",
|
||||
"summary": "",
|
||||
"summary_latency_ms": 0,
|
||||
"chunk_count": 0,
|
||||
"parser_name": "aliyun_docmind",
|
||||
"index_name": "",
|
||||
"error_message": "embedding 维度不匹配,期望 1536",
|
||||
"created_at": "2026-05-18T14:11:29.943973+00:00",
|
||||
"updated_at": "2026-05-18T14:11:48.554500+00:00",
|
||||
"metadata": {
|
||||
"generate_summary": true,
|
||||
"parser_backend": "aliyun_docmind",
|
||||
"parse_task_id": "docmind-20260518-23935ee455ac4b26ac4201ac4781ee52",
|
||||
"layout_count": 87,
|
||||
"structure_node_count": 20,
|
||||
"semantic_block_count": 27,
|
||||
"vector_chunk_count": 27,
|
||||
"artifact_keys": {
|
||||
"layouts": "artifacts/4302f314/layouts.json",
|
||||
"structure_nodes": "artifacts/4302f314/structure_nodes.json",
|
||||
"semantic_blocks": "artifacts/4302f314/semantic_blocks.json",
|
||||
"vector_chunks": "artifacts/4302f314/vector_chunks.json"
|
||||
},
|
||||
"processing_stage": "failed",
|
||||
"failure_reason": "embedding 维度不匹配,期望 1536"
|
||||
}
|
||||
},
|
||||
"765ed1ee": {
|
||||
"doc_id": "765ed1ee",
|
||||
"doc_name": "大众汽车手册.pdf",
|
||||
"file_name": "大众汽车手册.pdf",
|
||||
"object_name": "765ed1ee/大众汽车手册.pdf",
|
||||
"content_type": "application/pdf",
|
||||
"size_bytes": 766565,
|
||||
"status": "failed",
|
||||
"regulation_type": "",
|
||||
"version": "",
|
||||
"summary": "",
|
||||
"summary_latency_ms": 0,
|
||||
"chunk_count": 0,
|
||||
"parser_name": "aliyun_docmind",
|
||||
"index_name": "",
|
||||
"error_message": "<MilvusException: (code=1100, message=the dim (1024) of field data(embedding) is not equal to schema dim (1536): invalid parameter[expected=1536][actual=1024])>",
|
||||
"created_at": "2026-05-18T14:18:28.875138+00:00",
|
||||
"updated_at": "2026-05-18T14:18:57.389110+00:00",
|
||||
"metadata": {
|
||||
"generate_summary": true,
|
||||
"parser_backend": "aliyun_docmind",
|
||||
"parse_task_id": "docmind-20260518-f116856bc29245baa2531b245078a701",
|
||||
"layout_count": 87,
|
||||
"structure_node_count": 20,
|
||||
"semantic_block_count": 27,
|
||||
"vector_chunk_count": 27,
|
||||
"artifact_keys": {
|
||||
"layouts": "artifacts/765ed1ee/layouts.json",
|
||||
"structure_nodes": "artifacts/765ed1ee/structure_nodes.json",
|
||||
"semantic_blocks": "artifacts/765ed1ee/semantic_blocks.json",
|
||||
"vector_chunks": "artifacts/765ed1ee/vector_chunks.json"
|
||||
},
|
||||
"processing_stage": "failed",
|
||||
"failure_reason": "<MilvusException: (code=1100, message=the dim (1024) of field data(embedding) is not equal to schema dim (1536): invalid parameter[expected=1536][actual=1024])>"
|
||||
}
|
||||
},
|
||||
"05cabe09": {
|
||||
"doc_id": "05cabe09",
|
||||
"doc_name": "大众汽车手册.pdf",
|
||||
"file_name": "大众汽车手册.pdf",
|
||||
"object_name": "05cabe09/大众汽车手册.pdf",
|
||||
"content_type": "application/pdf",
|
||||
"size_bytes": 766565,
|
||||
"status": "failed",
|
||||
"regulation_type": "",
|
||||
"version": "",
|
||||
"summary": "",
|
||||
"summary_latency_ms": 0,
|
||||
"chunk_count": 0,
|
||||
"parser_name": "aliyun_docmind",
|
||||
"index_name": "",
|
||||
"error_message": "embedding 维度不匹配,期望 1536",
|
||||
"created_at": "2026-05-18T14:24:32.156500+00:00",
|
||||
"updated_at": "2026-05-18T14:24:50.114138+00:00",
|
||||
"metadata": {
|
||||
"generate_summary": true,
|
||||
"parser_backend": "aliyun_docmind",
|
||||
"parse_task_id": "docmind-20260518-897d858983df48e28e9819e563d46208",
|
||||
"layout_count": 87,
|
||||
"structure_node_count": 20,
|
||||
"semantic_block_count": 27,
|
||||
"vector_chunk_count": 27,
|
||||
"artifact_keys": {
|
||||
"layouts": "artifacts/05cabe09/layouts.json",
|
||||
"structure_nodes": "artifacts/05cabe09/structure_nodes.json",
|
||||
"semantic_blocks": "artifacts/05cabe09/semantic_blocks.json",
|
||||
"vector_chunks": "artifacts/05cabe09/vector_chunks.json"
|
||||
},
|
||||
"processing_stage": "failed",
|
||||
"failure_reason": "embedding 维度不匹配,期望 1536"
|
||||
}
|
||||
},
|
||||
"9acb2ba0": {
|
||||
"doc_id": "9acb2ba0",
|
||||
"doc_name": "大众汽车手册.pdf",
|
||||
"file_name": "大众汽车手册.pdf",
|
||||
"object_name": "9acb2ba0/大众汽车手册.pdf",
|
||||
"content_type": "application/pdf",
|
||||
"size_bytes": 766565,
|
||||
"7cbdfe3c": {
|
||||
"doc_id": "7cbdfe3c",
|
||||
"doc_name": "使用RSA Token连接CheckPoint VPN及PIN码设置_220.181.114.93 or 10.25.134.3.docx",
|
||||
"file_name": "使用RSA Token连接CheckPoint VPN及PIN码设置_220.181.114.93 or 10.25.134.3.docx",
|
||||
"object_name": "7cbdfe3c/使用RSA Token连接CheckPoint VPN及PIN码设置_220.181.114.93 or 10.25.134.3.docx",
|
||||
"content_type": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
|
||||
"size_bytes": 1199920,
|
||||
"status": "indexed",
|
||||
"regulation_type": "",
|
||||
"version": "",
|
||||
"summary": "",
|
||||
"summary_latency_ms": 0,
|
||||
"chunk_count": 27,
|
||||
"chunk_count": 34,
|
||||
"parser_name": "aliyun_docmind",
|
||||
"index_name": "regulations_dense_1024_v1",
|
||||
"index_name": "regulations_dense_1024_v2",
|
||||
"error_message": "",
|
||||
"created_at": "2026-05-18T14:29:01.368719+00:00",
|
||||
"updated_at": "2026-05-18T14:29:23.699068+00:00",
|
||||
"created_at": "2026-05-26T12:18:27.206125+00:00",
|
||||
"updated_at": "2026-05-26T12:18:51.171308+00:00",
|
||||
"metadata": {
|
||||
"generate_summary": true,
|
||||
"parser_backend": "aliyun_docmind",
|
||||
"parse_task_id": "docmind-20260518-e5fd4a5419e74d569c562e389e6ae72c",
|
||||
"layout_count": 87,
|
||||
"structure_node_count": 20,
|
||||
"semantic_block_count": 27,
|
||||
"vector_chunk_count": 27,
|
||||
"parse_task_id": "docmind-20260526-10b94713ccb348498b12180a5dcf32ff",
|
||||
"layout_count": 48,
|
||||
"structure_node_count": 6,
|
||||
"semantic_block_count": 33,
|
||||
"vector_chunk_count": 34,
|
||||
"artifact_keys": {
|
||||
"layouts": "artifacts/9acb2ba0/layouts.json",
|
||||
"structure_nodes": "artifacts/9acb2ba0/structure_nodes.json",
|
||||
"semantic_blocks": "artifacts/9acb2ba0/semantic_blocks.json",
|
||||
"vector_chunks": "artifacts/9acb2ba0/vector_chunks.json"
|
||||
"layouts": "artifacts/7cbdfe3c/layouts.json",
|
||||
"structure_nodes": "artifacts/7cbdfe3c/structure_nodes.json",
|
||||
"semantic_blocks": "artifacts/7cbdfe3c/semantic_blocks.json",
|
||||
"vector_chunks": "artifacts/7cbdfe3c/vector_chunks.json"
|
||||
},
|
||||
"processing_stage": "indexed",
|
||||
"index_collection": "regulations_dense_1024_v1"
|
||||
"index_collection": "regulations_dense_1024_v2"
|
||||
}
|
||||
}
|
||||
}
|
||||
@@ -7,7 +7,7 @@
|
||||
- 上传入口保持为 `/api/v1/documents/upload`
|
||||
- 默认 `PARSER_BACKEND=aliyun`
|
||||
- 默认 `CHUNK_BACKEND=aliyun`
|
||||
- 默认 Milvus collection 为 `regulations_dense_1536_v2`
|
||||
- 默认 Milvus collection 为 `regulations_dense_1024_v2`
|
||||
- 解析产物落到 MinIO `artifacts/{doc_id}/`
|
||||
|
||||
完整主链路如下:
|
||||
@@ -19,7 +19,7 @@
|
||||
5. 转换为 `structure_nodes / semantic_blocks / vector_chunks`
|
||||
6. 三层结构 JSON 回写 MinIO
|
||||
7. 使用 `vector_chunks[*].embedding_text` 调 embedding API
|
||||
8. 写入 `regulations_dense_1536_v2`
|
||||
8. 写入 `regulations_dense_1024_v2`
|
||||
9. 文档状态更新为 `indexed`
|
||||
|
||||
运行时转换逻辑位于 `backend/app/infrastructure/parser/aliyun_layout_normalizer.py`。
|
||||
|
||||
@@ -10,6 +10,31 @@
|
||||
- 本文档负责冻结目标模块边界、依赖规则和实现组织方式。
|
||||
- 后续任何代码重构、能力替换或底座升级,都应同时满足 RFC 与本文档。
|
||||
|
||||
## 1.1 Document Status And Authority
|
||||
|
||||
本文档不是仅供参考的“目标态草案”,而是当前 backend 持续开发的强制架构基线。
|
||||
|
||||
- 新增 backend 功能默认必须遵守本文档定义的模块边界与依赖方向。
|
||||
- 历史实现、迁移中代码和兼容 façade 的存在,不构成继续偏离本文档的理由。
|
||||
- 当现状与本文档冲突时,新增代码按本文档落位;旧代码按迁移计划逐步收口,但不允许继续扩大 legacy 边界。
|
||||
- 评审、重构验收和后续架构讨论,均以本文档作为 backend 内部结构的 authority。
|
||||
|
||||
## 1.2 Authoritative Scope
|
||||
|
||||
本文档约束的 backend 范围包括:
|
||||
|
||||
- `backend/app/api/*`
|
||||
- `backend/app/application/*`
|
||||
- `backend/app/domain/*`
|
||||
- `backend/app/infrastructure/*`
|
||||
- `backend/app/shared/*`
|
||||
|
||||
说明:
|
||||
|
||||
- `backend/app/services/*` 与 `backend/app/workflows/*` 当前属于迁移期 legacy 目录,不是新增业务逻辑的默认落点。
|
||||
- `backend/app/api/routes/docs.py` 与 `backend/app/api/routes/rag.py` 视为遗留或非主入口,除迁移、兼容或下线动作外,不应继续扩展。
|
||||
- `backend/app/api/routes/compliance.py` 当前仍对外暴露,但尚未完全满足本文档约束;在迁移到 application service 之前,应视为受控 legacy 入口,而不是新的架构样板。
|
||||
|
||||
## 2. Current-State Problems
|
||||
|
||||
基于当前代码,后端已经具备以下能力:
|
||||
@@ -22,6 +47,18 @@
|
||||
|
||||
但这些能力当前主要是“可运行”,还不是“结构清晰、便于替换、便于演进”的状态。核心问题如下。
|
||||
|
||||
### 2.0 Current-State Verdict
|
||||
|
||||
基于当前仓库,现状裁决如下:
|
||||
|
||||
- 已基本符合:`documents` 上传/查询主链路已经通过 `DocumentCommandService` 与 `DocumentQueryService` 收口。
|
||||
- 已基本符合:`knowledge` 检索已经通过 `KnowledgeRetrievalService` 统一对外暴露。
|
||||
- 已基本符合:`agent` 问答主链路已经通过 `AgentConversationService` 收口,`shared/bootstrap.py` 已承担 composition root 角色。
|
||||
- 部分符合:Agent session 详情、历史、删除、反馈等接口曾经直接访问 `ConversationStore`,需要继续收口到 application service。
|
||||
- 未完全符合:`compliance` 路由仍直接处理文件落盘、任务状态和 mock 结果,不符合 `api -> application -> domain ports -> infrastructure`。
|
||||
- 未完全符合:部分 `infrastructure` adapter 仍依赖 `services/*` 内的 legacy 实现,说明迁移尚未彻底完成。
|
||||
- 未完全符合:`api/main.py` 的生命周期预热逻辑仍直接依赖旧 LLM factory,尚未完全回到统一 wiring 边界。
|
||||
|
||||
### 2.1 `DocumentProcessor` 责任过载
|
||||
|
||||
现状判断:
|
||||
@@ -603,6 +640,7 @@ infrastructure -> external systems
|
||||
- `application` 只能依赖 `domain`、端口接口,以及通过 composition root 注入进来的实现实例
|
||||
- `domain` 不能依赖 `api` 或 `infrastructure`
|
||||
- `infrastructure` 可以依赖 `domain` 定义的端口和数据模型,但不能反向驱动 application 逻辑
|
||||
- `api/main.py` 这类应用入口可以保留轻量 startup/shutdown 生命周期代码,但不应长期直接依赖 legacy service factory;预热与装配逻辑应逐步收口到明确的 wiring 边界
|
||||
|
||||
说明:
|
||||
|
||||
@@ -739,6 +777,54 @@ infrastructure -> external systems
|
||||
- 内部 DTO / VO / domain object 收敛到 `application` 或 `domain`
|
||||
- 不允许 API model 直接渗透到 domain
|
||||
|
||||
### 10.10 应用入口与启动生命周期
|
||||
|
||||
当前:
|
||||
|
||||
- `backend/app/api/main.py`
|
||||
|
||||
目标:
|
||||
|
||||
- 保留 FastAPI app、middleware 和 lifespan 入口职责
|
||||
- 逐步去除对 legacy LLM factory 的直接依赖
|
||||
- 预热、清理和依赖装配应保持在明确的 wiring / bootstrap 边界内,而不是继续把旧 service factory 固化为应用入口依赖
|
||||
|
||||
### 10.11 Compliance 路由
|
||||
|
||||
当前:
|
||||
|
||||
- `backend/app/api/routes/compliance.py`
|
||||
|
||||
目标:
|
||||
|
||||
- 如继续保留该能力,应迁移到独立的 application service 与稳定端口
|
||||
- 在迁移完成前,该路由视为受控 legacy 入口,可修 bug,但不应继续扩展业务编排职责
|
||||
|
||||
### 10.12 遗留路由入口
|
||||
|
||||
当前:
|
||||
|
||||
- `backend/app/api/routes/docs.py`
|
||||
- `backend/app/api/routes/rag.py`
|
||||
|
||||
目标:
|
||||
|
||||
- 作为遗留或演示入口逐步归档、下线或迁移
|
||||
- 不再作为新增 backend 能力的开发入口
|
||||
|
||||
### 10.13 Legacy Workflow 与 Service 目录
|
||||
|
||||
当前:
|
||||
|
||||
- `backend/app/workflows/*`
|
||||
- `backend/app/services/*`
|
||||
|
||||
目标:
|
||||
|
||||
- 保留迁移期兼容价值,但不再承载新的长期业务编排
|
||||
- 若某个 legacy 实现仍被 `infrastructure` adapter 间接复用,应视为过渡依赖,后续逐步迁入 `infrastructure` 或更稳定的底层支撑模块
|
||||
- 任何新增 backend 业务能力,都不应再以这些目录作为默认落点
|
||||
|
||||
## 11. Technology Replacement Boundaries
|
||||
|
||||
### 11.1 本地解析 / MinerU -> 阿里云文档解析
|
||||
@@ -790,6 +876,10 @@ infrastructure -> external systems
|
||||
- 禁止新建第二个“大一统流程类”替代 `DocumentProcessor`
|
||||
- 禁止 `knowledge` 和 `agent` 各自维护独立检索实现
|
||||
- 禁止 parser、embedding、vector index、llm provider 的替换穿透到 API 层
|
||||
- 禁止新增 route 直接访问 `ConversationStore`
|
||||
- 禁止新增代码把 `backend/app/services/*` 或 `backend/app/workflows/*` 作为默认业务落点
|
||||
- 禁止新增 `infrastructure -> services/*` 的过渡依赖;已有依赖只允许在迁移窗口内逐步消除,不允许继续扩散
|
||||
- 禁止在 README、开发说明或评审结论中把 legacy 目录描述为当前 backend 的主结构
|
||||
|
||||
## 13. Architecture Review Checklist
|
||||
|
||||
@@ -807,3 +897,7 @@ infrastructure -> external systems
|
||||
10. 是否明确 `knowledge` 与 `agent` 共用同一 retrieval 底座。
|
||||
11. 是否明确 API 层只负责 transport concerns,不再直接承担业务编排。
|
||||
12. 是否保证后续替换方案时,上层 application service 与外部 API 契约不被迫变化。
|
||||
13. 是否仍存在 route 直接访问 `ConversationStore`、文件系统、对象存储或任务状态存储。
|
||||
14. 是否新增了 `infrastructure -> services/*` 依赖。
|
||||
15. 是否把新的 backend 业务逻辑写进了 `services/*` 或 `workflows/*`。
|
||||
16. README、backend README 与协作说明是否仍与当前 authoritative architecture 保持一致。
|
||||
|
||||
623
docs/architecture/document-core-processing-flow.md
Normal file
623
docs/architecture/document-core-processing-flow.md
Normal file
@@ -0,0 +1,623 @@
|
||||
# 核心文档处理主链路说明
|
||||
|
||||
本文件说明当前默认生产链路中的核心文档处理流程,也就是:
|
||||
|
||||
- `AliyunDocumentParser`
|
||||
- `AliyunVectorChunkBuilder`
|
||||
- `OpenAICompatibleEmbeddingProvider`
|
||||
- `MilvusVectorIndex`
|
||||
|
||||
目标是回答四个核心问题:
|
||||
|
||||
1. `ParsedDocument` 为什么是多层结构
|
||||
2. 这些结构分别保存到哪里
|
||||
3. 哪一步才真正做了向量化
|
||||
4. Milvus 里最后到底存的是什么
|
||||
|
||||
数据库表设计、关系模型、DDL 和 PostgreSQL 职责边界已经单独整理到 [document-processing-database-design.md](/abs/path/C:/Users/A200477427/Developers/AIRegulation/AIRegulation-DocAnalysis/docs/architecture/document-processing-database-design.md:1)。本文件保留流程视角,只在必要处给出与存储分工相关的摘要,不再作为数据库设计 authority。
|
||||
|
||||
## 1. 主链路总览
|
||||
|
||||
当前默认实现由 `DocumentCommandService.upload_and_process()` 统一编排。它不是“parse 完直接进向量库”,而是先生成结构化解析产物,再把其中适合检索的一层送去 embedding 和 Milvus。
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant API as API / Service
|
||||
participant MinIO as BinaryStore
|
||||
participant Parser as AliyunDocumentParser
|
||||
participant PG as DocumentRepository / ParseArtifactStore
|
||||
participant Embed as EmbeddingProvider
|
||||
participant Milvus as VectorIndex
|
||||
|
||||
API->>MinIO: 保存原始文件
|
||||
API->>Parser: parse(file_path, doc_id, doc_name)
|
||||
Parser-->>API: ParsedDocument
|
||||
API->>MinIO: 保存 layouts/structure_nodes/semantic_blocks/vector_chunks JSON
|
||||
API->>PG: 更新 documents.status=parsed
|
||||
API->>PG: 保存 structure_nodes / semantic_blocks
|
||||
API->>API: chunk_builder.build(parsed_document)
|
||||
API->>Embed: embed_texts([chunk.embedding_text])
|
||||
Embed-->>API: vectors
|
||||
API->>Milvus: upsert(chunks, vectors)
|
||||
API->>PG: 更新 documents.status=indexed
|
||||
```
|
||||
|
||||
主链路编排代码在 [services.py](/abs/path/C:/Users/A200477427/Developers/AIRegulation/AIRegulation-DocAnalysis/backend/app/application/documents/services.py:83):
|
||||
|
||||
```python
|
||||
def upload_and_process(
|
||||
self,
|
||||
*,
|
||||
doc_id: str | None = None,
|
||||
file_name: str,
|
||||
content: bytes,
|
||||
content_type: str,
|
||||
doc_name: str | None,
|
||||
regulation_type: str,
|
||||
version: str,
|
||||
generate_summary: bool,
|
||||
) -> DocumentProcessResult:
|
||||
doc_id = doc_id or str(uuid.uuid4())[:8]
|
||||
final_doc_name = doc_name or file_name
|
||||
object_name = f"{doc_id}/{file_name}"
|
||||
|
||||
self.document_repository.create(document)
|
||||
self.binary_store.save(object_name=object_name, data=content, content_type=content_type, metadata={"doc_id": doc_id})
|
||||
self.document_repository.update_status(doc_id, DocumentStatus.STORED)
|
||||
|
||||
parsed_document = self.parser.parse(file_path=temp_path, doc_id=doc_id, doc_name=final_doc_name)
|
||||
artifact_keys = self._save_parse_artifacts(doc_id=doc_id, parsed_document=parsed_document)
|
||||
self.document_repository.update_status(doc_id, DocumentStatus.PARSED, parser_name=parsed_document.parser_name, metadata={...})
|
||||
|
||||
if self.parse_artifact_store:
|
||||
self.parse_artifact_store.save(doc_id, parsed_document.structure_nodes, parsed_document.semantic_blocks)
|
||||
|
||||
chunks = self.chunk_builder.build(parsed_document=parsed_document, regulation_type=regulation_type, version=version)
|
||||
vectors = self.embedding_provider.embed_texts([chunk.embedding_text for chunk in chunks])
|
||||
inserted = self.vector_index.upsert(chunks, vectors)
|
||||
|
||||
self.document_repository.update_status(doc_id, DocumentStatus.INDEXED, chunk_count=len(chunks), index_name=health.get("collection_name", ""), metadata={...})
|
||||
```
|
||||
|
||||
默认绑定关系在 [bootstrap.py](/abs/path/C:/Users/A200477427/Developers/AIRegulation/AIRegulation-DocAnalysis/backend/app/shared/bootstrap.py:157):
|
||||
|
||||
```python
|
||||
def get_parser():
|
||||
if settings.parser_backend == "aliyun":
|
||||
return AliyunDocumentParser()
|
||||
return LocalDocumentParser()
|
||||
|
||||
|
||||
def get_chunk_builder():
|
||||
if settings.chunk_backend == "aliyun":
|
||||
return AliyunVectorChunkBuilder()
|
||||
return LocalRegulationChunkBuilder(...)
|
||||
|
||||
|
||||
def get_embedding_provider() -> OpenAICompatibleEmbeddingProvider:
|
||||
return OpenAICompatibleEmbeddingProvider()
|
||||
|
||||
|
||||
def get_vector_index() -> VectorIndex:
|
||||
return LazyVectorIndex(_build_vector_index)
|
||||
```
|
||||
|
||||
也就是说,当前默认主链路是:
|
||||
|
||||
- parser: `AliyunDocumentParser`
|
||||
- chunk builder: `AliyunVectorChunkBuilder`
|
||||
- embedding provider: `OpenAICompatibleEmbeddingProvider`
|
||||
- vector index: `MilvusVectorIndex`
|
||||
|
||||
## 2. `ParsedDocument` 为什么是三层结构
|
||||
|
||||
`ParsedDocument` 不是最终入库格式,而是 parser 输出给后续处理步骤的统一中间结构。它把“结构理解”和“向量检索准备”拆成了三层。
|
||||
|
||||
定义在 [models.py](/abs/path/C:/Users/A200477427/Developers/AIRegulation/AIRegulation-DocAnalysis/backend/app/domain/documents/models.py:49):
|
||||
|
||||
```python
|
||||
@dataclass
|
||||
class ParsedDocument:
|
||||
doc_id: str
|
||||
doc_name: str
|
||||
structure_nodes: list[dict[str, Any]]
|
||||
semantic_blocks: list[dict[str, Any]]
|
||||
vector_chunks: list[dict[str, Any]]
|
||||
parser_name: str
|
||||
raw_text: str = ""
|
||||
raw_layouts: list[dict[str, Any]] = field(default_factory=list)
|
||||
metadata: dict[str, Any] = field(default_factory=dict)
|
||||
```
|
||||
|
||||
这三层的职责不同:
|
||||
|
||||
- `structure_nodes`
|
||||
- 标题层级骨架
|
||||
- 描述“文档有哪些章、节、条”
|
||||
- 用于保留结构,不直接做 embedding
|
||||
|
||||
- `semantic_blocks`
|
||||
- 语义块层
|
||||
- 把正文、表格、图片说明整理成连续的语义单元
|
||||
- 是从原始 layout 到检索 chunk 之间的中间层
|
||||
|
||||
- `vector_chunks`
|
||||
- 检索和向量化层
|
||||
- 已经是适合送给 embedding 模型的 chunk 视图
|
||||
- 后续 `ChunkBuilder` 基本就是把这层映射成统一 `Chunk`
|
||||
|
||||
### 2.1 这三层是怎么从 parser 结果生成的
|
||||
|
||||
`AliyunDocumentParser.parse()` 先通过网关拿到阿里云返回的 `layouts`,再把 `layouts` 转成三层结构。
|
||||
|
||||
代码在 [aliyun_document_parser.py](/abs/path/C:/Users/A200477427/Developers/AIRegulation/AIRegulation-DocAnalysis/backend/app/infrastructure/parser/aliyun_document_parser.py:28):
|
||||
|
||||
```python
|
||||
def parse(self, *, file_path: str, doc_id: str, doc_name: str) -> ParsedDocument:
|
||||
payload = self.gateway.parse_document(file_path=file_path)
|
||||
layouts = payload.layouts
|
||||
structure_nodes = build_structure_nodes(layouts)
|
||||
semantic_blocks = build_semantic_blocks(layouts)
|
||||
vector_chunks = build_vector_chunks(
|
||||
semantic_blocks,
|
||||
doc_id=doc_id,
|
||||
doc_title=doc_name,
|
||||
max_chars=MAX_CHARS,
|
||||
overlap_chars=OVERLAP_CHARS,
|
||||
)
|
||||
raw_text = "\n\n".join(
|
||||
block.get("text", "")
|
||||
for block in semantic_blocks
|
||||
if block.get("text")
|
||||
)
|
||||
return ParsedDocument(
|
||||
doc_id=doc_id,
|
||||
doc_name=doc_name,
|
||||
structure_nodes=structure_nodes,
|
||||
semantic_blocks=semantic_blocks,
|
||||
vector_chunks=vector_chunks,
|
||||
parser_name=self.parser_name,
|
||||
raw_text=raw_text,
|
||||
raw_layouts=layouts,
|
||||
metadata={...},
|
||||
)
|
||||
```
|
||||
|
||||
也就是说:
|
||||
|
||||
- parser 原始输出是 `layouts`
|
||||
- 当前系统真正消费的是 `ParsedDocument`
|
||||
- `ParsedDocument` 是由 normalizer 从 `layouts` 规整出来的
|
||||
|
||||
### 2.2 第一层:`structure_nodes`
|
||||
|
||||
这一层只保留标题和层级。
|
||||
|
||||
代码在 [aliyun_layout_normalizer.py](/abs/path/C:/Users/A200477427/Developers/AIRegulation/AIRegulation-DocAnalysis/backend/app/infrastructure/parser/aliyun_layout_normalizer.py:85):
|
||||
|
||||
```python
|
||||
def build_structure_nodes(layouts: list[dict[str, Any]]) -> list[dict[str, Any]]:
|
||||
nodes: list[dict[str, Any]] = []
|
||||
for layout in layouts:
|
||||
if not is_title(layout):
|
||||
continue
|
||||
text = get_text(layout)
|
||||
if not text or text in TOC_TITLES:
|
||||
continue
|
||||
nodes.append(
|
||||
{
|
||||
"unique_id": layout.get("uniqueId"),
|
||||
"page": get_page(layout),
|
||||
"index": layout.get("index", 0),
|
||||
"level": layout.get("level", 0),
|
||||
"title": text,
|
||||
"type": layout.get("type"),
|
||||
"sub_type": layout.get("subType"),
|
||||
}
|
||||
)
|
||||
return nodes
|
||||
```
|
||||
|
||||
示例:
|
||||
|
||||
```json
|
||||
[
|
||||
{
|
||||
"unique_id": "l-title-001",
|
||||
"page": 2,
|
||||
"index": 11,
|
||||
"level": 1,
|
||||
"title": "1 范围",
|
||||
"type": "title",
|
||||
"sub_type": "para_title"
|
||||
},
|
||||
{
|
||||
"unique_id": "l-title-002",
|
||||
"page": 3,
|
||||
"index": 18,
|
||||
"level": 2,
|
||||
"title": "1.1 适用对象",
|
||||
"type": "title",
|
||||
"sub_type": "para_title"
|
||||
}
|
||||
]
|
||||
```
|
||||
|
||||
这层的意义是“保留目录树”,不是直接拿来检索。
|
||||
|
||||
### 2.3 第二层:`semantic_blocks`
|
||||
|
||||
这一层会把连续正文合并成一个语义块,也会单独处理表格和图片说明。
|
||||
|
||||
代码在 [aliyun_layout_normalizer.py](/abs/path/C:/Users/A200477427/Developers/AIRegulation/AIRegulation-DocAnalysis/backend/app/infrastructure/parser/aliyun_layout_normalizer.py:163):
|
||||
|
||||
```python
|
||||
def build_semantic_blocks(layouts: list[dict[str, Any]]) -> list[dict[str, Any]]:
|
||||
semantic_blocks: list[dict[str, Any]] = []
|
||||
section_stack: list[dict[str, Any]] = []
|
||||
pending_text_blocks: list[dict[str, Any]] = []
|
||||
block_id = 1
|
||||
|
||||
for layout in layouts:
|
||||
text = get_text(layout)
|
||||
page = get_page(layout)
|
||||
|
||||
if is_title(layout):
|
||||
block_id = flush_text_block(pending_text_blocks, semantic_blocks, block_id)
|
||||
pending_text_blocks = []
|
||||
section_stack = update_section_path(section_stack, layout)
|
||||
continue
|
||||
|
||||
section_path = section_path_titles(section_stack)
|
||||
section_title = section_path[-1] if section_path else "未分类"
|
||||
section_level = len(section_path)
|
||||
|
||||
if is_table(layout):
|
||||
...
|
||||
semantic_blocks.append(
|
||||
{
|
||||
"semantic_id": f"semantic-{block_id}",
|
||||
"block_type": "table",
|
||||
"page_start": page,
|
||||
"page_end": page,
|
||||
"section_path": section_path,
|
||||
"section_level": section_level,
|
||||
"section_title": section_title,
|
||||
"source_ids": [layout.get("uniqueId")],
|
||||
"text": table_text,
|
||||
}
|
||||
)
|
||||
continue
|
||||
|
||||
if is_text(layout) and text:
|
||||
pending_text_blocks.append(
|
||||
{
|
||||
"page": page,
|
||||
"text": text,
|
||||
"unique_id": layout.get("uniqueId"),
|
||||
"section_path": section_path,
|
||||
"section_level": section_level,
|
||||
"section_title": section_title,
|
||||
}
|
||||
)
|
||||
```
|
||||
|
||||
正文合并后会形成类似这样的语义块:
|
||||
|
||||
```json
|
||||
[
|
||||
{
|
||||
"semantic_id": "semantic-1",
|
||||
"block_type": "section_text",
|
||||
"page_start": 2,
|
||||
"page_end": 2,
|
||||
"section_path": ["1 范围", "1.1 适用对象"],
|
||||
"section_level": 2,
|
||||
"section_title": "1.1 适用对象",
|
||||
"source_ids": ["l-text-001", "l-text-002"],
|
||||
"text": "本标准适用于道路车辆动力电池系统的安全要求。企业应建立一致的测试和验证方法。"
|
||||
}
|
||||
]
|
||||
```
|
||||
|
||||
## 3. 这些结构分别保存到哪里
|
||||
|
||||
### 3.1 原始文件和中间 artifacts 先落 MinIO
|
||||
|
||||
当前链路在上传阶段会先把原始文件保存到对象存储;解析完成后,又会把结构化中间产物保存为 JSON。
|
||||
|
||||
保存 artifacts 的代码在 [services.py](/abs/path/C:/Users/A200477427/Developers/AIRegulation/AIRegulation-DocAnalysis/backend/app/application/documents/services.py:62):
|
||||
|
||||
```python
|
||||
def _save_parse_artifacts(self, *, doc_id: str, parsed_document: ParsedDocument) -> dict[str, str]:
|
||||
prefix = f"{parsed_document.metadata.get('artifact_prefix', 'artifacts').strip('/')}/{doc_id}"
|
||||
artifact_payloads = {
|
||||
"layouts": parsed_document.raw_layouts,
|
||||
"structure_nodes": parsed_document.structure_nodes,
|
||||
"semantic_blocks": parsed_document.semantic_blocks,
|
||||
"vector_chunks": parsed_document.vector_chunks,
|
||||
}
|
||||
artifact_keys: dict[str, str] = {}
|
||||
for name, payload in artifact_payloads.items():
|
||||
object_name = f"{prefix}/{name}.json"
|
||||
self.binary_store.save(
|
||||
object_name=object_name,
|
||||
data=json.dumps(payload, ensure_ascii=False, indent=2).encode("utf-8"),
|
||||
content_type="application/json",
|
||||
metadata={"doc_id": doc_id, "artifact_type": name},
|
||||
)
|
||||
artifact_keys[name] = object_name
|
||||
return artifact_keys
|
||||
```
|
||||
|
||||
`DocumentBinaryStore` 的当前默认实现是 `MinioDocumentBinaryStore`,也就是:
|
||||
|
||||
- 原始上传文件进 MinIO
|
||||
- `layouts.json` 进 MinIO
|
||||
- `structure_nodes.json` 进 MinIO
|
||||
- `semantic_blocks.json` 进 MinIO
|
||||
- `vector_chunks.json` 进 MinIO
|
||||
|
||||
### 3.2 PostgreSQL 在流程中的职责摘要
|
||||
|
||||
当前流程中,PostgreSQL 承担的是“文档元数据 + 结构化快照”的职责,而不是向量或大对象存储:
|
||||
|
||||
- `documents` 保存当前文档主记录、状态、统计和索引信息
|
||||
- `structure_nodes` 保存当前最新解析快照的目录结构
|
||||
- `semantic_blocks` 保存当前最新解析快照的语义块结构
|
||||
|
||||
更完整的 PostgreSQL 设计,包括:
|
||||
|
||||
- `documents`
|
||||
- `document_processing_runs`
|
||||
- `document_status_history`
|
||||
- `document_artifacts`
|
||||
- `structure_nodes`
|
||||
- `semantic_blocks`
|
||||
|
||||
见 [document-processing-database-design.md](/abs/path/C:/Users/A200477427/Developers/AIRegulation/AIRegulation-DocAnalysis/docs/architecture/document-processing-database-design.md:1)。
|
||||
|
||||
### 3.3 存储分工一览
|
||||
|
||||
| 数据层 | 保存位置 | 是否直接用于 embedding | 是否最终进入 Milvus | 主要用途 |
|
||||
| --- | --- | --- | --- | --- |
|
||||
| 原始文件 | MinIO | 否 | 否 | 保留原始上传文档 |
|
||||
| `raw_layouts` | MinIO `layouts.json` | 否 | 否 | 保留 parser 原始返回 |
|
||||
| `structure_nodes` | MinIO + PostgreSQL | 否 | 否 | 目录树、层级结构 |
|
||||
| `semantic_blocks` | MinIO + PostgreSQL | 否 | 间接 | 语义单元、中间层 |
|
||||
| `vector_chunks` | MinIO | 是 | 间接 | embedding 前的检索块 |
|
||||
| `Chunk` | 内存态 + Milvus | 是 | 是 | 统一向量入库模型 |
|
||||
| `documents` 元数据 | PostgreSQL | 否 | 否 | 处理状态、统计、索引信息 |
|
||||
|
||||
## 4. 哪一步才真正“变成向量”
|
||||
|
||||
这是整个流程最关键的点。
|
||||
|
||||
结论先说清楚:
|
||||
|
||||
- parse 不做向量化
|
||||
- 保存 artifacts 不做向量化
|
||||
- `ChunkBuilder.build()` 也不做向量化
|
||||
- 只有 `EmbeddingProvider.embed_texts()` 才真正调用 embedding 模型
|
||||
- 只有 `VectorIndex.upsert()` 才真正把向量写入向量库
|
||||
|
||||
### 4.1 `vector_chunks` 先被映射成统一 `Chunk`
|
||||
|
||||
`AliyunVectorChunkBuilder` 并不做 embedding,它只负责把 `ParsedDocument.vector_chunks` 转成领域层统一 `Chunk` 模型。
|
||||
|
||||
代码在 [vector_chunk_builder.py](/abs/path/C:/Users/A200477427/Developers/AIRegulation/AIRegulation-DocAnalysis/backend/app/infrastructure/parser/vector_chunk_builder.py:12):
|
||||
|
||||
```python
|
||||
def build(
|
||||
self,
|
||||
*,
|
||||
parsed_document: ParsedDocument,
|
||||
regulation_type: str,
|
||||
version: str,
|
||||
) -> list[Chunk]:
|
||||
chunks: list[Chunk] = []
|
||||
for index, item in enumerate(parsed_document.vector_chunks):
|
||||
content = item.get("content") or item.get("text") or ""
|
||||
embedding_text = item.get("embedding_text") or content
|
||||
if not embedding_text.strip():
|
||||
continue
|
||||
section_path = item.get("section_path") or []
|
||||
section_title = item.get("section_title") or (section_path[-1] if section_path else "")
|
||||
page_number = item.get("page_start") or item.get("page") or 0
|
||||
chunk_id = item.get("chunk_id") or f"{parsed_document.doc_id}-chunk-{index}"
|
||||
metadata = {k: v for k, v in item.items() if k not in {"content", "embedding_text"}}
|
||||
chunks.append(
|
||||
Chunk(
|
||||
chunk_id=str(chunk_id),
|
||||
doc_id=parsed_document.doc_id,
|
||||
doc_name=parsed_document.doc_name,
|
||||
content=content,
|
||||
embedding_text=embedding_text,
|
||||
section_title=section_title,
|
||||
section_path=section_path,
|
||||
page_number=int(page_number or 0),
|
||||
regulation_type=regulation_type,
|
||||
version=version,
|
||||
semantic_id=item.get("semantic_id", ""),
|
||||
block_type=item.get("block_type", ""),
|
||||
metadata=metadata,
|
||||
)
|
||||
)
|
||||
return chunks
|
||||
```
|
||||
|
||||
`Chunk` 的定义在 [models.py](/abs/path/C:/Users/A200477427/Developers/AIRegulation/AIRegulation-DocAnalysis/backend/app/domain/documents/models.py:63):
|
||||
|
||||
```python
|
||||
@dataclass
|
||||
class Chunk:
|
||||
chunk_id: str
|
||||
doc_id: str
|
||||
doc_name: str
|
||||
content: str
|
||||
embedding_text: str
|
||||
section_title: str = ""
|
||||
section_path: list[str] = field(default_factory=list)
|
||||
page_number: int = 0
|
||||
regulation_type: str = ""
|
||||
version: str = ""
|
||||
semantic_id: str = ""
|
||||
block_type: str = ""
|
||||
metadata: dict[str, Any] = field(default_factory=dict)
|
||||
```
|
||||
|
||||
一个 `Chunk` 的典型样子如下:
|
||||
|
||||
```json
|
||||
{
|
||||
"chunk_id": "doc-001-chunk-1",
|
||||
"doc_id": "doc-001",
|
||||
"doc_name": "动力电池安全规范",
|
||||
"content": "本标准适用于道路车辆动力电池系统的安全要求。企业应建立一致的测试和验证方法。",
|
||||
"embedding_text": "标准:动力电池安全规范\n章节:1 范围 > 1.1 适用对象\n\n本标准适用于道路车辆动力电池系统的安全要求。企业应建立一致的测试和验证方法。",
|
||||
"section_title": "1.1 适用对象",
|
||||
"section_path": ["1 范围", "1.1 适用对象"],
|
||||
"page_number": 2,
|
||||
"regulation_type": "GB",
|
||||
"version": "2025",
|
||||
"semantic_id": "semantic-1",
|
||||
"block_type": "section_text",
|
||||
"metadata": {
|
||||
"chunk_index": 1,
|
||||
"piece_index": 1,
|
||||
"source_ids": ["l-text-001", "l-text-002"]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
这里最关键的是要区分两个字段:
|
||||
|
||||
- `content`
|
||||
- 用于检索命中后的展示内容
|
||||
- 更接近用户最终看到的正文片段
|
||||
|
||||
- `embedding_text`
|
||||
- 用于送给 embedding 模型
|
||||
- 比 `content` 多了“标准名 + 章节路径”的上下文
|
||||
|
||||
所以“向量化输入”不是纯正文,而是增强后的上下文文本。
|
||||
|
||||
### 4.2 真正调用 embedding API 的地方
|
||||
|
||||
真正把文本变成向量的是 `OpenAICompatibleEmbeddingProvider.embed_texts()`。
|
||||
|
||||
代码在 [openai_compatible_embedding_provider.py](/abs/path/C:/Users/A200477427/Developers/AIRegulation/AIRegulation-DocAnalysis/backend/app/infrastructure/embedding/openai_compatible_embedding_provider.py:64):
|
||||
|
||||
```python
|
||||
def embed_texts(self, texts: list[str]) -> list[list[float]]:
|
||||
if not texts:
|
||||
return []
|
||||
```
|
||||
|
||||
也就是说,只有在这一步:
|
||||
|
||||
- 输入:`list[str]` 的 `embedding_text`
|
||||
- 输出:`list[list[float]]` 的 dense vectors
|
||||
|
||||
前面的 parse、normalizer、chunk builder 都只是准备文本,没有任何向量值产生。
|
||||
|
||||
### 4.3 真正把向量写进 Milvus 的地方
|
||||
|
||||
向量值生成之后,`MilvusVectorIndex.upsert()` 才会把 `Chunk + vector` 写入向量库。
|
||||
|
||||
代码在 [milvus_vector_index.py](/abs/path/C:/Users/A200477427/Developers/AIRegulation/AIRegulation-DocAnalysis/backend/app/infrastructure/vectorstore/milvus_vector_index.py:69):
|
||||
|
||||
```python
|
||||
def upsert(self, chunks: list[Chunk], vectors: list[list[float]]) -> int:
|
||||
if len(chunks) != len(vectors):
|
||||
raise ValueError("chunks 与 vectors 数量不一致")
|
||||
data = []
|
||||
now = int(time.time())
|
||||
for chunk, vector in zip(chunks, vectors):
|
||||
data.append(
|
||||
{
|
||||
"id": chunk.chunk_id,
|
||||
"doc_id": chunk.doc_id,
|
||||
"doc_name": chunk.doc_name,
|
||||
"content": chunk.content[:65535],
|
||||
"embedding": vector,
|
||||
"section_title": chunk.section_title[:512],
|
||||
"section_path": json.dumps(chunk.section_path, ensure_ascii=False)[:4096],
|
||||
"page_number": chunk.page_number,
|
||||
"regulation_type": chunk.regulation_type[:128],
|
||||
"version": chunk.version[:64],
|
||||
"semantic_id": chunk.semantic_id[:128],
|
||||
"block_type": chunk.block_type[:64],
|
||||
"metadata_json": json.dumps(chunk.metadata, ensure_ascii=False)[:65535],
|
||||
"created_at": now,
|
||||
}
|
||||
)
|
||||
self.collection.insert(data)
|
||||
self.collection.flush()
|
||||
return len(data)
|
||||
```
|
||||
|
||||
也就是说,Milvus 最终存进去的是:
|
||||
|
||||
- 主键:`chunk_id`
|
||||
- 文档维度字段:`doc_id`、`doc_name`
|
||||
- 检索展示字段:`content`
|
||||
- 向量字段:`embedding`
|
||||
- 过滤/回溯字段:`section_title`、`section_path`、`page_number`、`regulation_type`、`version`、`semantic_id`、`block_type`
|
||||
- 附加元数据:`metadata_json`
|
||||
|
||||
## 5. Milvus 里最后到底存的是什么
|
||||
|
||||
### 5.1 Collection schema
|
||||
|
||||
当前 `MilvusVectorIndex` 初始化 collection 时定义的 schema 在 [milvus_vector_index.py](/abs/path/C:/Users/A200477427/Developers/AIRegulation/AIRegulation-DocAnalysis/backend/app/infrastructure/vectorstore/milvus_vector_index.py:37):
|
||||
|
||||
```python
|
||||
schema = CollectionSchema(
|
||||
fields=[
|
||||
FieldSchema(name="id", dtype=DataType.VARCHAR, max_length=128, is_primary=True, auto_id=False),
|
||||
FieldSchema(name="doc_id", dtype=DataType.VARCHAR, max_length=64),
|
||||
FieldSchema(name="doc_name", dtype=DataType.VARCHAR, max_length=256),
|
||||
FieldSchema(name="content", dtype=DataType.VARCHAR, max_length=65535),
|
||||
FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=settings.embedding_dim),
|
||||
FieldSchema(name="section_title", dtype=DataType.VARCHAR, max_length=512),
|
||||
FieldSchema(name="section_path", dtype=DataType.VARCHAR, max_length=4096),
|
||||
FieldSchema(name="page_number", dtype=DataType.INT64),
|
||||
FieldSchema(name="regulation_type", dtype=DataType.VARCHAR, max_length=128),
|
||||
FieldSchema(name="version", dtype=DataType.VARCHAR, max_length=64),
|
||||
FieldSchema(name="semantic_id", dtype=DataType.VARCHAR, max_length=128),
|
||||
FieldSchema(name="block_type", dtype=DataType.VARCHAR, max_length=64),
|
||||
FieldSchema(name="metadata_json", dtype=DataType.VARCHAR, max_length=65535),
|
||||
FieldSchema(name="created_at", dtype=DataType.INT64),
|
||||
],
|
||||
description="Dense-only regulations index",
|
||||
enable_dynamic_field=False,
|
||||
)
|
||||
```
|
||||
|
||||
这说明 Milvus 存的不是“只有 embedding 的极简向量表”,而是:
|
||||
|
||||
- 一个 dense vector
|
||||
- 一组检索时要返回或过滤的结构化字段
|
||||
|
||||
但要注意:这并不意味着 Milvus 是业务主记录库。它仍然主要服务于检索,而不是替代 PostgreSQL 的文档管理职责。
|
||||
|
||||
### 5.2 `list_documents()` 为什么会先看 Milvus
|
||||
|
||||
文档列表查询在 [services.py](/abs/path/C:/Users/A200477427/Developers/AIRegulation/AIRegulation-DocAnalysis/backend/app/application/documents/services.py:271) 中实现,它会:
|
||||
|
||||
1. 从 Milvus 查询当前真的存在向量的文档
|
||||
2. 从文档元数据仓储加载文档记录
|
||||
3. 以 Milvus 为索引状态真相源进行 merge
|
||||
|
||||
原因不是“Milvus 替代 PostgreSQL”,而是:
|
||||
|
||||
- `indexed` 这个状态最终是否真实成立,要看 Milvus 里有没有对应 chunk
|
||||
- 但下载、删除、重试、文件定位、错误信息仍然要依赖文档元数据仓储
|
||||
|
||||
所以:
|
||||
|
||||
- Milvus 是“索引真相源”
|
||||
- PostgreSQL/JSON 是“文档元数据真相源”
|
||||
|
||||
这两者职责不同,不能互相替代。
|
||||
508
docs/architecture/document-processing-database-design.md
Normal file
508
docs/architecture/document-processing-database-design.md
Normal file
@@ -0,0 +1,508 @@
|
||||
# 文档处理链路数据库设计
|
||||
|
||||
## 1. Purpose
|
||||
|
||||
本文档定义当前文档处理主链路的 PostgreSQL 数据库设计,覆盖上传、解析、索引、状态查询、重试、删除这条核心链路,以及围绕该链路的常用运维与审计需求。
|
||||
|
||||
本文档的目标不是替代 [document-core-processing-flow.md](/abs/path/C:/Users/A200477427/Developers/AIRegulation/AIRegulation-DocAnalysis/docs/architecture/document-core-processing-flow.md:1) 的流程说明,而是补齐关系型存储的 authority,使后续从 JSON 元数据切换到 PostgreSQL 时有清晰、稳定、可实施的数据库设计基线。
|
||||
|
||||
## 1.1 Scope And Design Target
|
||||
|
||||
本文档只覆盖以下范围:
|
||||
|
||||
- 文档主记录
|
||||
- 文档处理运行记录
|
||||
- 文档状态历史
|
||||
- 解析产物引用
|
||||
- 当前最新结构化解析快照
|
||||
|
||||
本文档不覆盖以下范围:
|
||||
|
||||
- Agent 会话
|
||||
- 反馈和人工审核
|
||||
- 合规分析任务
|
||||
- Milvus collection schema 的详细实现
|
||||
|
||||
设计原则采用 `Compat First`:
|
||||
|
||||
- 保持与当前 `DocumentRepository` / `ParseArtifactStore` 主流程兼容
|
||||
- 新增关系表以补足运维与审计能力
|
||||
- 不为了理想化模型而反推大规模接口重写
|
||||
|
||||
## 2. Storage Responsibilities
|
||||
|
||||
当前系统采用三类存储,各自职责必须清晰分离:
|
||||
|
||||
| 存储 | 保存内容 | 是否业务主记录 | 说明 |
|
||||
| --- | --- | --- | --- |
|
||||
| MinIO | 原始文件、`layouts.json`、`structure_nodes.json`、`semantic_blocks.json`、`vector_chunks.json` | 否 | 负责大对象与产物归档,不承担关系查询 |
|
||||
| Milvus | chunk 级向量和检索辅助字段 | 否 | 负责向量检索,不承担文档生命周期管理 |
|
||||
| PostgreSQL | 文档元数据、处理状态、结构化快照、处理历史、artifact 引用 | 是 | 负责文档管理、运维可观测性和关系查询 |
|
||||
|
||||
约束说明:
|
||||
|
||||
- PostgreSQL 不保存 embedding 向量。
|
||||
- PostgreSQL 不新增 `vector_chunks` 内容表。
|
||||
- Milvus 可以保存 `doc_id`、`doc_name`、`regulation_type`、`version` 等检索辅助字段,但不是业务真相源。
|
||||
- 文档下载、删除、重试仍以 PostgreSQL 中的文档主记录为入口。
|
||||
|
||||
## 3. Design Overview
|
||||
|
||||
### 3.1 Entity Responsibilities
|
||||
|
||||
数据库采用“当前态主记录 + 当前快照 + 历史过程”的分层模型:
|
||||
|
||||
- `documents`
|
||||
- 当前文档主记录
|
||||
- 保存供管理、下载、重试、删除直接使用的元数据和当前状态
|
||||
- `document_processing_runs`
|
||||
- 每次上传或重试对应一次处理运行
|
||||
- 保存运行级统计、阶段时间点和失败信息
|
||||
- `document_status_history`
|
||||
- 追加式状态事件流
|
||||
- 保存每次状态变更的上下文
|
||||
- `document_artifacts`
|
||||
- 保存 MinIO artifact 的引用信息
|
||||
- 不保存 artifact 内容本体
|
||||
- `structure_nodes`
|
||||
- 当前最新解析快照中的目录结构
|
||||
- `semantic_blocks`
|
||||
- 当前最新解析快照中的语义块结构
|
||||
|
||||
### 3.2 Current Snapshot Vs Historical Records
|
||||
|
||||
本设计显式区分两类数据:
|
||||
|
||||
- 当前快照
|
||||
- `documents`
|
||||
- `structure_nodes`
|
||||
- `semantic_blocks`
|
||||
- 历史过程
|
||||
- `document_processing_runs`
|
||||
- `document_status_history`
|
||||
- `document_artifacts`
|
||||
|
||||
其中:
|
||||
|
||||
- `structure_nodes` 和 `semantic_blocks` 只保存“最新一次成功解析后”的当前快照
|
||||
- 历史版本回溯依赖 `document_processing_runs`、`document_artifacts` 和 MinIO 中对应 run 的 artifact 文件
|
||||
|
||||
## 4. Table Design
|
||||
|
||||
### 4.1 `documents`
|
||||
|
||||
用途:
|
||||
|
||||
- 作为文档生命周期的主记录表
|
||||
- 为下载、删除、重试、管理列表、状态查询提供当前态真相
|
||||
|
||||
字段设计:
|
||||
|
||||
```sql
|
||||
CREATE TABLE IF NOT EXISTS documents (
|
||||
doc_id VARCHAR(128) PRIMARY KEY,
|
||||
doc_name VARCHAR(512) NOT NULL DEFAULT '',
|
||||
file_name VARCHAR(512) NOT NULL DEFAULT '',
|
||||
object_name VARCHAR(1024) NOT NULL DEFAULT '',
|
||||
content_type VARCHAR(128) NOT NULL DEFAULT '',
|
||||
size_bytes BIGINT NOT NULL DEFAULT 0,
|
||||
status VARCHAR(32) NOT NULL DEFAULT 'pending',
|
||||
regulation_type VARCHAR(128) NOT NULL DEFAULT '',
|
||||
version VARCHAR(64) NOT NULL DEFAULT '',
|
||||
summary TEXT NOT NULL DEFAULT '',
|
||||
summary_latency_ms INTEGER NOT NULL DEFAULT 0,
|
||||
chunk_count INTEGER NOT NULL DEFAULT 0,
|
||||
parser_name VARCHAR(128) NOT NULL DEFAULT '',
|
||||
index_name VARCHAR(128) NOT NULL DEFAULT '',
|
||||
error_message TEXT NOT NULL DEFAULT '',
|
||||
metadata JSONB NOT NULL DEFAULT '{}'::jsonb,
|
||||
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
|
||||
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
|
||||
CONSTRAINT chk_documents_status
|
||||
CHECK (status IN ('pending', 'stored', 'parsed', 'indexed', 'failed'))
|
||||
);
|
||||
|
||||
CREATE INDEX IF NOT EXISTS idx_documents_status_updated_at
|
||||
ON documents(status, updated_at DESC);
|
||||
|
||||
CREATE INDEX IF NOT EXISTS idx_documents_regulation_version
|
||||
ON documents(regulation_type, version);
|
||||
|
||||
CREATE INDEX IF NOT EXISTS idx_documents_updated_at
|
||||
ON documents(updated_at DESC);
|
||||
```
|
||||
|
||||
字段说明:
|
||||
|
||||
- `object_name`
|
||||
- 原始上传文件在 MinIO 中的对象路径
|
||||
- 当前实现依赖该字段完成下载、重试和删除,v1 不拆分为独立文件表
|
||||
- `status`
|
||||
- 当前文档处理状态
|
||||
- 仅表示当前态,不承担历史审计职责
|
||||
- `metadata`
|
||||
- 保存轻量、变动频率较高、暂不值得列式建模的附加信息
|
||||
- 典型内容包括 `parse_task_id`、`processing_stage`、`artifact_keys`、统计计数等
|
||||
|
||||
### 4.2 `document_processing_runs`
|
||||
|
||||
用途:
|
||||
|
||||
- 记录一次上传或一次重试的完整处理运行
|
||||
- 用于解释“这份文档本次处理为什么成功或失败”
|
||||
|
||||
字段设计:
|
||||
|
||||
```sql
|
||||
CREATE TABLE IF NOT EXISTS document_processing_runs (
|
||||
run_id BIGSERIAL PRIMARY KEY,
|
||||
doc_id VARCHAR(128) NOT NULL,
|
||||
trigger_type VARCHAR(16) NOT NULL,
|
||||
run_status VARCHAR(16) NOT NULL,
|
||||
parser_backend VARCHAR(64) NOT NULL DEFAULT '',
|
||||
chunk_backend VARCHAR(64) NOT NULL DEFAULT '',
|
||||
embedding_model VARCHAR(128) NOT NULL DEFAULT '',
|
||||
index_name VARCHAR(128) NOT NULL DEFAULT '',
|
||||
started_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
|
||||
stored_at TIMESTAMPTZ,
|
||||
parsed_at TIMESTAMPTZ,
|
||||
indexed_at TIMESTAMPTZ,
|
||||
finished_at TIMESTAMPTZ,
|
||||
layout_count INTEGER NOT NULL DEFAULT 0,
|
||||
structure_node_count INTEGER NOT NULL DEFAULT 0,
|
||||
semantic_block_count INTEGER NOT NULL DEFAULT 0,
|
||||
vector_chunk_count INTEGER NOT NULL DEFAULT 0,
|
||||
chunk_count INTEGER NOT NULL DEFAULT 0,
|
||||
failure_stage VARCHAR(32) NOT NULL DEFAULT '',
|
||||
error_message TEXT NOT NULL DEFAULT '',
|
||||
metadata JSONB NOT NULL DEFAULT '{}'::jsonb,
|
||||
CONSTRAINT fk_runs_document
|
||||
FOREIGN KEY (doc_id) REFERENCES documents(doc_id) ON DELETE CASCADE,
|
||||
CONSTRAINT chk_runs_trigger_type
|
||||
CHECK (trigger_type IN ('upload', 'retry')),
|
||||
CONSTRAINT chk_runs_status
|
||||
CHECK (run_status IN ('running', 'succeeded', 'failed'))
|
||||
);
|
||||
|
||||
CREATE INDEX IF NOT EXISTS idx_runs_doc_started_at
|
||||
ON document_processing_runs(doc_id, started_at DESC);
|
||||
|
||||
CREATE INDEX IF NOT EXISTS idx_runs_status_started_at
|
||||
ON document_processing_runs(run_status, started_at DESC);
|
||||
```
|
||||
|
||||
字段说明:
|
||||
|
||||
- `trigger_type`
|
||||
- 标识该次处理由首次上传还是 retry 触发
|
||||
- `run_status`
|
||||
- 只表示该次运行的最终结果
|
||||
- `failure_stage`
|
||||
- 建议取值与应用层关键阶段一致,例如 `store`、`parse`、`artifact_persist`、`embed`、`index`
|
||||
- `metadata`
|
||||
- 保存运行级附加上下文,例如配置快照、后端实现名、provider 返回信息摘要
|
||||
|
||||
### 4.3 `document_status_history`
|
||||
|
||||
用途:
|
||||
|
||||
- 保存状态变化事件流
|
||||
- 用于排障、审计和运行轨迹分析
|
||||
|
||||
字段设计:
|
||||
|
||||
```sql
|
||||
CREATE TABLE IF NOT EXISTS document_status_history (
|
||||
event_id BIGSERIAL PRIMARY KEY,
|
||||
doc_id VARCHAR(128) NOT NULL,
|
||||
run_id BIGINT,
|
||||
from_status VARCHAR(32) NOT NULL DEFAULT '',
|
||||
to_status VARCHAR(32) NOT NULL,
|
||||
stage VARCHAR(32) NOT NULL DEFAULT '',
|
||||
message TEXT NOT NULL DEFAULT '',
|
||||
metadata JSONB NOT NULL DEFAULT '{}'::jsonb,
|
||||
occurred_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
|
||||
CONSTRAINT fk_status_document
|
||||
FOREIGN KEY (doc_id) REFERENCES documents(doc_id) ON DELETE CASCADE,
|
||||
CONSTRAINT fk_status_run
|
||||
FOREIGN KEY (run_id) REFERENCES document_processing_runs(run_id) ON DELETE CASCADE,
|
||||
CONSTRAINT chk_status_history_to_status
|
||||
CHECK (to_status IN ('pending', 'stored', 'parsed', 'indexed', 'failed'))
|
||||
);
|
||||
|
||||
CREATE INDEX IF NOT EXISTS idx_status_history_doc_occurred_at
|
||||
ON document_status_history(doc_id, occurred_at DESC);
|
||||
|
||||
CREATE INDEX IF NOT EXISTS idx_status_history_run_occurred_at
|
||||
ON document_status_history(run_id, occurred_at DESC);
|
||||
```
|
||||
|
||||
字段说明:
|
||||
|
||||
- `from_status` 可以为空字符串
|
||||
- 用于首个事件,例如文档创建时进入 `pending`
|
||||
- `stage`
|
||||
- 用于记录状态推进对应的业务阶段
|
||||
- `message`
|
||||
- 用于记录面向排障的人类可读说明
|
||||
|
||||
### 4.4 `document_artifacts`
|
||||
|
||||
用途:
|
||||
|
||||
- 保存解析产物在 MinIO 中的位置与基本属性
|
||||
- 支持后续定位某次 run 的 artifacts,而不扫描对象存储
|
||||
|
||||
字段设计:
|
||||
|
||||
```sql
|
||||
CREATE TABLE IF NOT EXISTS document_artifacts (
|
||||
artifact_id BIGSERIAL PRIMARY KEY,
|
||||
doc_id VARCHAR(128) NOT NULL,
|
||||
run_id BIGINT,
|
||||
artifact_type VARCHAR(32) NOT NULL,
|
||||
object_name VARCHAR(1024) NOT NULL,
|
||||
content_type VARCHAR(128) NOT NULL DEFAULT 'application/json',
|
||||
byte_size BIGINT NOT NULL DEFAULT 0,
|
||||
checksum VARCHAR(128) NOT NULL DEFAULT '',
|
||||
metadata JSONB NOT NULL DEFAULT '{}'::jsonb,
|
||||
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
|
||||
CONSTRAINT fk_artifacts_document
|
||||
FOREIGN KEY (doc_id) REFERENCES documents(doc_id) ON DELETE CASCADE,
|
||||
CONSTRAINT fk_artifacts_run
|
||||
FOREIGN KEY (run_id) REFERENCES document_processing_runs(run_id) ON DELETE CASCADE,
|
||||
CONSTRAINT chk_artifact_type
|
||||
CHECK (artifact_type IN ('layouts', 'structure_nodes', 'semantic_blocks', 'vector_chunks'))
|
||||
);
|
||||
|
||||
CREATE INDEX IF NOT EXISTS idx_artifacts_doc_created_at
|
||||
ON document_artifacts(doc_id, created_at DESC);
|
||||
|
||||
CREATE INDEX IF NOT EXISTS idx_artifacts_run_type
|
||||
ON document_artifacts(run_id, artifact_type);
|
||||
|
||||
CREATE UNIQUE INDEX IF NOT EXISTS uq_artifacts_run_type_object
|
||||
ON document_artifacts(run_id, artifact_type, object_name);
|
||||
```
|
||||
|
||||
字段说明:
|
||||
|
||||
- 该表只记录 artifact 引用,不记录原始文件
|
||||
- 原始文件仍由 `documents.object_name` 表达,这是为了保持当前下载和重试逻辑兼容
|
||||
|
||||
### 4.5 `structure_nodes`
|
||||
|
||||
用途:
|
||||
|
||||
- 保存当前最新解析快照中的标题层级结构
|
||||
- 供目录树查询、结构化浏览、调试和审计使用
|
||||
|
||||
字段设计:
|
||||
|
||||
```sql
|
||||
CREATE TABLE IF NOT EXISTS structure_nodes (
|
||||
id BIGSERIAL PRIMARY KEY,
|
||||
doc_id VARCHAR(128) NOT NULL,
|
||||
unique_id VARCHAR(128),
|
||||
page INTEGER NOT NULL DEFAULT 0,
|
||||
idx INTEGER NOT NULL DEFAULT 0,
|
||||
level INTEGER NOT NULL DEFAULT 0,
|
||||
title TEXT NOT NULL DEFAULT '',
|
||||
type VARCHAR(64),
|
||||
sub_type VARCHAR(64),
|
||||
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
|
||||
CONSTRAINT fk_structure_nodes_document
|
||||
FOREIGN KEY (doc_id) REFERENCES documents(doc_id) ON DELETE CASCADE
|
||||
);
|
||||
|
||||
CREATE INDEX IF NOT EXISTS idx_structure_nodes_doc_idx
|
||||
ON structure_nodes(doc_id, idx);
|
||||
|
||||
CREATE INDEX IF NOT EXISTS idx_structure_nodes_doc_level
|
||||
ON structure_nodes(doc_id, level);
|
||||
```
|
||||
|
||||
设计约束:
|
||||
|
||||
- 该表表示当前快照,不做多版本建模
|
||||
- 新一轮成功解析会覆盖同一 `doc_id` 的旧快照
|
||||
|
||||
### 4.6 `semantic_blocks`
|
||||
|
||||
用途:
|
||||
|
||||
- 保存当前最新解析快照中的语义块
|
||||
- 供结构回溯、调试和后续关系型查询使用
|
||||
|
||||
字段设计:
|
||||
|
||||
```sql
|
||||
CREATE TABLE IF NOT EXISTS semantic_blocks (
|
||||
id BIGSERIAL PRIMARY KEY,
|
||||
doc_id VARCHAR(128) NOT NULL,
|
||||
semantic_id VARCHAR(128) NOT NULL,
|
||||
block_type VARCHAR(64) NOT NULL DEFAULT '',
|
||||
page_start INTEGER NOT NULL DEFAULT 0,
|
||||
page_end INTEGER NOT NULL DEFAULT 0,
|
||||
section_path JSONB NOT NULL DEFAULT '[]'::jsonb,
|
||||
section_level INTEGER NOT NULL DEFAULT 0,
|
||||
section_title VARCHAR(512) NOT NULL DEFAULT '',
|
||||
source_ids JSONB NOT NULL DEFAULT '[]'::jsonb,
|
||||
text TEXT NOT NULL DEFAULT '',
|
||||
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
|
||||
CONSTRAINT fk_semantic_blocks_document
|
||||
FOREIGN KEY (doc_id) REFERENCES documents(doc_id) ON DELETE CASCADE,
|
||||
CONSTRAINT uq_semantic_blocks_doc_semantic
|
||||
UNIQUE (doc_id, semantic_id)
|
||||
);
|
||||
|
||||
CREATE INDEX IF NOT EXISTS idx_semantic_blocks_doc_id
|
||||
ON semantic_blocks(doc_id);
|
||||
|
||||
CREATE INDEX IF NOT EXISTS idx_semantic_blocks_doc_section_title
|
||||
ON semantic_blocks(doc_id, section_title);
|
||||
|
||||
CREATE INDEX IF NOT EXISTS idx_semantic_blocks_doc_block_type
|
||||
ON semantic_blocks(doc_id, block_type);
|
||||
```
|
||||
|
||||
设计约束:
|
||||
|
||||
- 该表表示当前快照,不保存历史版本
|
||||
- 历史回溯应通过 run 对应的 artifact 文件完成
|
||||
|
||||
## 5. Relationship Model
|
||||
|
||||
实体关系如下:
|
||||
|
||||
```mermaid
|
||||
erDiagram
|
||||
documents ||--o{ document_processing_runs : has
|
||||
documents ||--o{ document_status_history : has
|
||||
documents ||--o{ document_artifacts : has
|
||||
documents ||--o{ structure_nodes : has
|
||||
documents ||--o{ semantic_blocks : has
|
||||
document_processing_runs ||--o{ document_status_history : emits
|
||||
document_processing_runs ||--o{ document_artifacts : produces
|
||||
```
|
||||
|
||||
关系语义:
|
||||
|
||||
- `documents` 是聚合根
|
||||
- `document_processing_runs` 记录一次完整处理尝试
|
||||
- `document_status_history` 记录状态推进轨迹
|
||||
- `document_artifacts` 记录 MinIO 中可回放的结构化产物
|
||||
- `structure_nodes` / `semantic_blocks` 代表“当前版本”的关系型快照
|
||||
|
||||
## 6. Flow-To-Table Mapping
|
||||
|
||||
### 6.1 Upload
|
||||
|
||||
上传开始时:
|
||||
|
||||
1. 创建 `documents`
|
||||
2. 创建一条 `document_processing_runs`
|
||||
3. 写入一条 `document_status_history`,`to_status='pending'`
|
||||
|
||||
### 6.2 Store Original File
|
||||
|
||||
原始文件写入 MinIO 成功后:
|
||||
|
||||
1. 更新 `documents.status='stored'`
|
||||
2. 更新当前 run 的 `stored_at`
|
||||
3. 追加 `document_status_history`
|
||||
|
||||
### 6.3 Parse And Persist Artifacts
|
||||
|
||||
解析成功后:
|
||||
|
||||
1. 更新当前 run 的 `parsed_at`
|
||||
2. 更新 run 的 `layout_count`、`structure_node_count`、`semantic_block_count`、`vector_chunk_count`
|
||||
3. 更新 `documents.status='parsed'`
|
||||
4. 刷新 `structure_nodes`
|
||||
5. 刷新 `semantic_blocks`
|
||||
6. 为 `layouts`、`structure_nodes`、`semantic_blocks`、`vector_chunks` 写入 `document_artifacts`
|
||||
7. 追加 `document_status_history`
|
||||
|
||||
### 6.4 Embed And Index
|
||||
|
||||
向量化和入库成功后:
|
||||
|
||||
1. 更新当前 run 的 `indexed_at`、`finished_at`
|
||||
2. 更新当前 run 的 `run_status='succeeded'`
|
||||
3. 更新 `documents.status='indexed'`
|
||||
4. 更新 `documents.chunk_count`、`index_name`
|
||||
5. 追加 `document_status_history`
|
||||
|
||||
### 6.5 Failure
|
||||
|
||||
任一阶段失败时:
|
||||
|
||||
1. 更新当前 run 的 `run_status='failed'`
|
||||
2. 记录 `failure_stage` 和 `error_message`
|
||||
3. 更新 `finished_at`
|
||||
4. 更新 `documents.status='failed'`
|
||||
5. 更新 `documents.error_message`
|
||||
6. 追加 `document_status_history`
|
||||
|
||||
### 6.6 Retry
|
||||
|
||||
重试时:
|
||||
|
||||
1. 保留现有 `documents.doc_id`
|
||||
2. 新建一条 `document_processing_runs`
|
||||
3. 为本次重试重新写入状态历史
|
||||
4. 本次重试成功后覆盖 `structure_nodes` / `semantic_blocks` 当前快照
|
||||
5. 历史 run 和 artifact 记录继续保留
|
||||
|
||||
### 6.7 Delete
|
||||
|
||||
删除文档时:
|
||||
|
||||
1. 应用层先删除 MinIO 原始文件和 artifacts
|
||||
2. 应用层删除 Milvus 中按 `doc_id` 关联的向量
|
||||
3. 最后删除 `documents`
|
||||
4. 依赖外键 `ON DELETE CASCADE` 清理 run、status history、artifacts、structure nodes、semantic blocks
|
||||
|
||||
## 7. Alignment With Current Backend
|
||||
|
||||
### 7.1 Compatible Parts
|
||||
|
||||
当前代码已天然兼容以下设计:
|
||||
|
||||
- `documents`
|
||||
- `structure_nodes`
|
||||
- `semantic_blocks`
|
||||
- 当前快照覆盖式更新
|
||||
- `doc_id` 作为跨 MinIO / Milvus / PostgreSQL 的统一关联键
|
||||
|
||||
### 7.2 Required Future Additions
|
||||
|
||||
若后续正式切到 PostgreSQL 默认元数据后端,应新增以下内部 store 或 repository:
|
||||
|
||||
- `DocumentProcessingRunStore`
|
||||
- `DocumentStatusEventStore`
|
||||
- `DocumentArtifactStore`
|
||||
|
||||
这些新增能力属于内部增强,不要求修改现有 HTTP API。
|
||||
|
||||
### 7.3 Migration Guidance
|
||||
|
||||
从当前 JSON 元数据切换到 PostgreSQL 时,建议按以下顺序进行:
|
||||
|
||||
1. 迁移 `documents.json` 中已有文档主记录到 `documents`
|
||||
2. 将 `DOCUMENT_REPOSITORY_BACKEND` 切换为 `postgres`
|
||||
3. 为新上传或重试的文档开始写入 run / status history / artifact records
|
||||
4. 历史文档若缺少 run 级数据,可允许为空,不阻塞切换
|
||||
|
||||
## 8. Non-Goals
|
||||
|
||||
以下能力不在本设计 v1 范围内:
|
||||
|
||||
- 将 Milvus 替换为 PostgreSQL 向量能力
|
||||
- 在 PostgreSQL 中保存向量字段
|
||||
- 为 `vector_chunks` 建独立关系表
|
||||
- 为 `structure_nodes` / `semantic_blocks` 建历史版本仓库
|
||||
- 将原始文件抽象成独立 `document_files` 表
|
||||
|
||||
这些能力可能在后续重构时被讨论,但不应影响当前主链路切换和现有应用层兼容性。
|
||||
2
frontend/.env
Normal file
2
frontend/.env
Normal file
@@ -0,0 +1,2 @@
|
||||
VITE_API_PROXY_TARGET=http://6.86.80.8:8000
|
||||
FRONTEND_PORT=5173
|
||||
2
frontend/.env.development
Normal file
2
frontend/.env.development
Normal file
@@ -0,0 +1,2 @@
|
||||
VITE_API_PROXY_TARGET=http://127.0.0.1:8000
|
||||
FRONTEND_PORT=5173
|
||||
2
frontend/.env.example
Normal file
2
frontend/.env.example
Normal file
@@ -0,0 +1,2 @@
|
||||
VITE_API_PROXY_TARGET=http://127.0.0.1:8000
|
||||
FRONTEND_PORT=5173
|
||||
@@ -49,6 +49,12 @@ npm run dev
|
||||
|
||||
启动本地开发服务器,默认访问 `http://localhost:5173`
|
||||
|
||||
前端环境文件约定如下:
|
||||
|
||||
- `frontend/.env.development`:本地开发,默认代理到 `http://127.0.0.1:8000`
|
||||
- `frontend/.env.production`:生产构建,默认代理到 `http://6.86.80.8:8000`
|
||||
- `frontend/.env.local`:临时覆盖本机配置,优先级高于上面两者
|
||||
|
||||
### 构建生产版本
|
||||
|
||||
```bash
|
||||
|
||||
25
frontend/components.json
Normal file
25
frontend/components.json
Normal file
@@ -0,0 +1,25 @@
|
||||
{
|
||||
"$schema": "https://ui.shadcn.com/schema.json",
|
||||
"style": "radix-nova",
|
||||
"rsc": false,
|
||||
"tsx": true,
|
||||
"tailwind": {
|
||||
"config": "tailwind.config.js",
|
||||
"css": "src/styles/globals.css",
|
||||
"baseColor": "neutral",
|
||||
"cssVariables": true,
|
||||
"prefix": ""
|
||||
},
|
||||
"iconLibrary": "lucide",
|
||||
"rtl": false,
|
||||
"aliases": {
|
||||
"components": "@/components",
|
||||
"utils": "@/lib/utils",
|
||||
"ui": "@/components/shadcn/ui",
|
||||
"lib": "@/lib",
|
||||
"hooks": "@/hooks"
|
||||
},
|
||||
"menuColor": "default",
|
||||
"menuAccent": "subtle",
|
||||
"registries": {}
|
||||
}
|
||||
67
frontend/components/ui/button.tsx
Normal file
67
frontend/components/ui/button.tsx
Normal file
@@ -0,0 +1,67 @@
|
||||
import * as React from "react"
|
||||
import { cva, type VariantProps } from "class-variance-authority"
|
||||
import { Slot } from "radix-ui"
|
||||
|
||||
import { cn } from "@/lib/utils"
|
||||
|
||||
const buttonVariants = cva(
|
||||
"group/button inline-flex shrink-0 items-center justify-center rounded-lg border border-transparent bg-clip-padding text-sm font-medium whitespace-nowrap transition-all outline-none select-none focus-visible:border-ring focus-visible:ring-3 focus-visible:ring-ring/50 active:not-aria-[haspopup]:translate-y-px disabled:pointer-events-none disabled:opacity-50 aria-invalid:border-destructive aria-invalid:ring-3 aria-invalid:ring-destructive/20 dark:aria-invalid:border-destructive/50 dark:aria-invalid:ring-destructive/40 [&_svg]:pointer-events-none [&_svg]:shrink-0 [&_svg:not([class*='size-'])]:size-4",
|
||||
{
|
||||
variants: {
|
||||
variant: {
|
||||
default: "bg-primary text-primary-foreground [a]:hover:bg-primary/80",
|
||||
outline:
|
||||
"border-border bg-background hover:bg-muted hover:text-foreground aria-expanded:bg-muted aria-expanded:text-foreground dark:border-input dark:bg-input/30 dark:hover:bg-input/50",
|
||||
secondary:
|
||||
"bg-secondary text-secondary-foreground hover:bg-secondary/80 aria-expanded:bg-secondary aria-expanded:text-secondary-foreground",
|
||||
ghost:
|
||||
"hover:bg-muted hover:text-foreground aria-expanded:bg-muted aria-expanded:text-foreground dark:hover:bg-muted/50",
|
||||
destructive:
|
||||
"bg-destructive/10 text-destructive hover:bg-destructive/20 focus-visible:border-destructive/40 focus-visible:ring-destructive/20 dark:bg-destructive/20 dark:hover:bg-destructive/30 dark:focus-visible:ring-destructive/40",
|
||||
link: "text-primary underline-offset-4 hover:underline",
|
||||
},
|
||||
size: {
|
||||
default:
|
||||
"h-8 gap-1.5 px-2.5 has-data-[icon=inline-end]:pr-2 has-data-[icon=inline-start]:pl-2",
|
||||
xs: "h-6 gap-1 rounded-[min(var(--radius-md),10px)] px-2 text-xs in-data-[slot=button-group]:rounded-lg has-data-[icon=inline-end]:pr-1.5 has-data-[icon=inline-start]:pl-1.5 [&_svg:not([class*='size-'])]:size-3",
|
||||
sm: "h-7 gap-1 rounded-[min(var(--radius-md),12px)] px-2.5 text-[0.8rem] in-data-[slot=button-group]:rounded-lg has-data-[icon=inline-end]:pr-1.5 has-data-[icon=inline-start]:pl-1.5 [&_svg:not([class*='size-'])]:size-3.5",
|
||||
lg: "h-9 gap-1.5 px-2.5 has-data-[icon=inline-end]:pr-2 has-data-[icon=inline-start]:pl-2",
|
||||
icon: "size-8",
|
||||
"icon-xs":
|
||||
"size-6 rounded-[min(var(--radius-md),10px)] in-data-[slot=button-group]:rounded-lg [&_svg:not([class*='size-'])]:size-3",
|
||||
"icon-sm":
|
||||
"size-7 rounded-[min(var(--radius-md),12px)] in-data-[slot=button-group]:rounded-lg",
|
||||
"icon-lg": "size-9",
|
||||
},
|
||||
},
|
||||
defaultVariants: {
|
||||
variant: "default",
|
||||
size: "default",
|
||||
},
|
||||
}
|
||||
)
|
||||
|
||||
function Button({
|
||||
className,
|
||||
variant = "default",
|
||||
size = "default",
|
||||
asChild = false,
|
||||
...props
|
||||
}: React.ComponentProps<"button"> &
|
||||
VariantProps<typeof buttonVariants> & {
|
||||
asChild?: boolean
|
||||
}) {
|
||||
const Comp = asChild ? Slot.Root : "button"
|
||||
|
||||
return (
|
||||
<Comp
|
||||
data-slot="button"
|
||||
data-variant={variant}
|
||||
data-size={size}
|
||||
className={cn(buttonVariants({ variant, size, className }))}
|
||||
{...props}
|
||||
/>
|
||||
)
|
||||
}
|
||||
|
||||
export { Button }
|
||||
60
frontend/package-lock.json
generated
60
frontend/package-lock.json
generated
@@ -9,7 +9,8 @@
|
||||
"version": "0.0.0",
|
||||
"dependencies": {
|
||||
"react": "^19.2.5",
|
||||
"react-dom": "^19.2.5"
|
||||
"react-dom": "^19.2.5",
|
||||
"react-router-dom": "^7.9.6"
|
||||
},
|
||||
"devDependencies": {
|
||||
"@eslint/js": "^10.0.1",
|
||||
@@ -1631,6 +1632,19 @@
|
||||
"dev": true,
|
||||
"license": "MIT"
|
||||
},
|
||||
"node_modules/cookie": {
|
||||
"version": "1.1.1",
|
||||
"resolved": "https://registry.npmjs.org/cookie/-/cookie-1.1.1.tgz",
|
||||
"integrity": "sha512-ei8Aos7ja0weRpFzJnEA9UHJ/7XQmqglbRwnf2ATjcB9Wq874VKH9kfjjirM6UhU2/E5fFYadylyhFldcqSidQ==",
|
||||
"license": "MIT",
|
||||
"engines": {
|
||||
"node": ">=18"
|
||||
},
|
||||
"funding": {
|
||||
"type": "opencollective",
|
||||
"url": "https://opencollective.com/express"
|
||||
}
|
||||
},
|
||||
"node_modules/cross-spawn": {
|
||||
"version": "7.0.6",
|
||||
"resolved": "https://registry.npmjs.org/cross-spawn/-/cross-spawn-7.0.6.tgz",
|
||||
@@ -2751,6 +2765,44 @@
|
||||
"react": "^19.2.5"
|
||||
}
|
||||
},
|
||||
"node_modules/react-router": {
|
||||
"version": "7.15.1",
|
||||
"resolved": "https://registry.npmjs.org/react-router/-/react-router-7.15.1.tgz",
|
||||
"integrity": "sha512-R8rl9HhgikFYoPJymnUtPXWbnDb3oget6lQnfIoupbt61aT9aOhRkDsY2XRhZRyX1Z/8a5sL74fXmFNm3NRK5A==",
|
||||
"license": "MIT",
|
||||
"dependencies": {
|
||||
"cookie": "^1.0.1",
|
||||
"set-cookie-parser": "^2.6.0"
|
||||
},
|
||||
"engines": {
|
||||
"node": ">=20.0.0"
|
||||
},
|
||||
"peerDependencies": {
|
||||
"react": ">=18",
|
||||
"react-dom": ">=18"
|
||||
},
|
||||
"peerDependenciesMeta": {
|
||||
"react-dom": {
|
||||
"optional": true
|
||||
}
|
||||
}
|
||||
},
|
||||
"node_modules/react-router-dom": {
|
||||
"version": "7.15.1",
|
||||
"resolved": "https://registry.npmjs.org/react-router-dom/-/react-router-dom-7.15.1.tgz",
|
||||
"integrity": "sha512-AzF62gjY6U9rkMq4RfP/r2EVtQ7DMfNMjyOp/flLTCrtRylLiK4wT4pSq6O8rOXZ2eXdZYJPEYe+ifomiv+Igg==",
|
||||
"license": "MIT",
|
||||
"dependencies": {
|
||||
"react-router": "7.15.1"
|
||||
},
|
||||
"engines": {
|
||||
"node": ">=20.0.0"
|
||||
},
|
||||
"peerDependencies": {
|
||||
"react": ">=18",
|
||||
"react-dom": ">=18"
|
||||
}
|
||||
},
|
||||
"node_modules/rolldown": {
|
||||
"version": "1.0.0-rc.17",
|
||||
"resolved": "https://registry.npmjs.org/rolldown/-/rolldown-1.0.0-rc.17.tgz",
|
||||
@@ -2808,6 +2860,12 @@
|
||||
"semver": "bin/semver.js"
|
||||
}
|
||||
},
|
||||
"node_modules/set-cookie-parser": {
|
||||
"version": "2.7.2",
|
||||
"resolved": "https://registry.npmjs.org/set-cookie-parser/-/set-cookie-parser-2.7.2.tgz",
|
||||
"integrity": "sha512-oeM1lpU/UvhTxw+g3cIfxXHyJRc/uidd3yK1P242gzHds0udQBYzs3y8j4gCCW+ZJ7ad0yctld8RYO+bdurlvw==",
|
||||
"license": "MIT"
|
||||
},
|
||||
"node_modules/shebang-command": {
|
||||
"version": "2.0.0",
|
||||
"resolved": "https://registry.npmjs.org/shebang-command/-/shebang-command-2.0.0.tgz",
|
||||
|
||||
@@ -10,8 +10,17 @@
|
||||
"preview": "vite preview"
|
||||
},
|
||||
"dependencies": {
|
||||
"@fontsource-variable/geist": "^5.2.9",
|
||||
"class-variance-authority": "^0.7.1",
|
||||
"clsx": "^2.1.1",
|
||||
"lucide-react": "^1.16.0",
|
||||
"radix-ui": "^1.4.3",
|
||||
"react": "^19.2.5",
|
||||
"react-dom": "^19.2.5"
|
||||
"react-dom": "^19.2.5",
|
||||
"react-router-dom": "^7.9.6",
|
||||
"shadcn": "^4.8.0",
|
||||
"tailwind-merge": "^3.6.0",
|
||||
"tw-animate-css": "^1.4.0"
|
||||
},
|
||||
"devDependencies": {
|
||||
"@eslint/js": "^10.0.1",
|
||||
|
||||
4076
frontend/pnpm-lock.yaml
generated
4076
frontend/pnpm-lock.yaml
generated
File diff suppressed because it is too large
Load Diff
@@ -1,46 +1,11 @@
|
||||
import './styles/globals.css';
|
||||
import { ThemeProvider, AppProvider, useApp, useTheme } from './contexts';
|
||||
import { Header, Tabs } from './components/layout';
|
||||
import { CompliancePage } from './pages/Compliance';
|
||||
import { DocsPage } from './pages/Docs';
|
||||
import { StatusPage } from './pages/Status';
|
||||
import { RagChatPage } from './pages/RagChat';
|
||||
|
||||
const PageContent = () => {
|
||||
const { activeTab } = useApp();
|
||||
|
||||
switch (activeTab) {
|
||||
case 'docs':
|
||||
return <DocsPage />;
|
||||
case 'compliance':
|
||||
return <CompliancePage />;
|
||||
case 'status':
|
||||
return <StatusPage />;
|
||||
case 'rag':
|
||||
return <RagChatPage />;
|
||||
default:
|
||||
return <CompliancePage />;
|
||||
}
|
||||
};
|
||||
|
||||
const AppContent = () => {
|
||||
const { theme } = useTheme();
|
||||
|
||||
return (
|
||||
<div className="h-full flex flex-col min-h-screen" style={{ backgroundColor: theme.bg }}>
|
||||
<Header />
|
||||
<Tabs />
|
||||
<PageContent />
|
||||
</div>
|
||||
);
|
||||
};
|
||||
import { ThemeProvider } from './contexts';
|
||||
import { AppRouter } from './router/AppRouter';
|
||||
|
||||
function App() {
|
||||
return (
|
||||
<ThemeProvider>
|
||||
<AppProvider>
|
||||
<AppContent />
|
||||
</AppProvider>
|
||||
<AppRouter />
|
||||
</ThemeProvider>
|
||||
);
|
||||
}
|
||||
|
||||
128
frontend/src/api/perception.ts
Normal file
128
frontend/src/api/perception.ts
Normal file
@@ -0,0 +1,128 @@
|
||||
const PERCEPTION_API_BASE = '/api/v1';
|
||||
|
||||
export type ImpactLevel = 'high' | 'medium' | 'low';
|
||||
export type EventStatus = 'enacted' | 'draft' | 'consultation';
|
||||
export type EventSource = 'MIIT' | 'UN-ECE' | 'ISO' | '国标委' | 'EUR-Lex' | 'IATF';
|
||||
|
||||
export interface RegulationEvent {
|
||||
id: string;
|
||||
source: EventSource;
|
||||
source_label: string;
|
||||
standard_code: string;
|
||||
title: string;
|
||||
summary: string;
|
||||
impact_level: ImpactLevel;
|
||||
published_at: string;
|
||||
effective_at: string | null;
|
||||
category: string;
|
||||
tags: string[];
|
||||
source_url: string;
|
||||
status: EventStatus;
|
||||
}
|
||||
|
||||
export interface PerceptionStats {
|
||||
total: number;
|
||||
high_impact: number;
|
||||
medium_impact: number;
|
||||
low_impact: number;
|
||||
recent_90d: number;
|
||||
}
|
||||
|
||||
export interface EventListResponse {
|
||||
events: RegulationEvent[];
|
||||
total: number;
|
||||
}
|
||||
|
||||
export interface AffectedDoc {
|
||||
doc_id: string;
|
||||
doc_name: string;
|
||||
score: number;
|
||||
snippet: string;
|
||||
clause: string;
|
||||
}
|
||||
|
||||
export interface AnalysisSSEMessage {
|
||||
type: 'sources' | 'content' | 'done' | 'error';
|
||||
docs?: AffectedDoc[];
|
||||
text?: string;
|
||||
}
|
||||
|
||||
export async function getPerceptionStats(): Promise<PerceptionStats> {
|
||||
const res = await fetch(`${PERCEPTION_API_BASE}/perception/stats`);
|
||||
if (!res.ok) throw new Error(`stats failed: ${res.status}`);
|
||||
return res.json() as Promise<PerceptionStats>;
|
||||
}
|
||||
|
||||
export async function listEvents(params?: {
|
||||
source?: string;
|
||||
impact_level?: string;
|
||||
limit?: number;
|
||||
}): Promise<EventListResponse> {
|
||||
const query = new URLSearchParams();
|
||||
if (params?.source) query.set('source', params.source);
|
||||
if (params?.impact_level) query.set('impact_level', params.impact_level);
|
||||
if (params?.limit) query.set('limit', String(params.limit));
|
||||
const res = await fetch(`${PERCEPTION_API_BASE}/perception/events?${query.toString()}`);
|
||||
if (!res.ok) throw new Error(`list events failed: ${res.status}`);
|
||||
return res.json() as Promise<EventListResponse>;
|
||||
}
|
||||
|
||||
export async function analyzeEvent(
|
||||
eventId: string,
|
||||
onMessage: (msg: AnalysisSSEMessage) => void,
|
||||
onComplete?: () => void,
|
||||
signal?: AbortSignal,
|
||||
): Promise<void> {
|
||||
try {
|
||||
const res = await fetch(`${PERCEPTION_API_BASE}/perception/events/${eventId}/analyze`, {
|
||||
method: 'POST',
|
||||
headers: { Accept: 'text/event-stream' },
|
||||
signal,
|
||||
});
|
||||
if (!res.ok || !res.body) throw new Error(`analyze failed: ${res.status}`);
|
||||
|
||||
const reader = res.body.getReader();
|
||||
const decoder = new TextDecoder();
|
||||
let buffer = '';
|
||||
|
||||
while (true) {
|
||||
const { done, value } = await reader.read();
|
||||
if (done) break;
|
||||
buffer += decoder.decode(value, { stream: true });
|
||||
const parts = buffer.split('\n\n');
|
||||
buffer = parts.pop() ?? '';
|
||||
for (const block of parts) {
|
||||
if (!block.trim()) continue;
|
||||
let eventName = 'message';
|
||||
const dataLines: string[] = [];
|
||||
for (const line of block.split('\n')) {
|
||||
if (line.startsWith('event:')) eventName = line.slice(6).trim();
|
||||
else if (line.startsWith('data:')) dataLines.push(line.slice(5).trim());
|
||||
}
|
||||
const payload = dataLines.join('\n');
|
||||
if (!payload) continue;
|
||||
|
||||
if (eventName === 'sources') {
|
||||
try {
|
||||
const docs = JSON.parse(payload) as AffectedDoc[];
|
||||
onMessage({ type: 'sources', docs });
|
||||
} catch { /* ignore */ }
|
||||
} else if (eventName === 'content') {
|
||||
onMessage({ type: 'content', text: payload });
|
||||
} else if (eventName === 'done') {
|
||||
onMessage({ type: 'done' });
|
||||
} else if (eventName === 'error') {
|
||||
onMessage({ type: 'error', text: payload });
|
||||
}
|
||||
}
|
||||
}
|
||||
if (buffer.trim()) {
|
||||
// flush remaining
|
||||
}
|
||||
onComplete?.();
|
||||
} catch (err) {
|
||||
if (err instanceof DOMException && err.name === 'AbortError') return;
|
||||
onMessage({ type: 'error', text: err instanceof Error ? err.message : String(err) });
|
||||
onComplete?.();
|
||||
}
|
||||
}
|
||||
@@ -1,35 +1,34 @@
|
||||
import React from 'react';
|
||||
import { Moon, Sun, SunMedium } from 'lucide-react';
|
||||
|
||||
import { useTheme } from '../../contexts';
|
||||
import { Button } from '../shadcn/ui/button';
|
||||
|
||||
const NEXT_LABELS: Record<string, string> = {
|
||||
dark: '过渡色模式',
|
||||
dim: '亮色模式',
|
||||
light: '暗色模式',
|
||||
};
|
||||
|
||||
export const ThemeToggle: React.FC = () => {
|
||||
const { isDark, toggleTheme, theme } = useTheme();
|
||||
const { themeMode, toggleTheme } = useTheme();
|
||||
|
||||
// Shows the NEXT state's icon: dark→SunMedium(dim next), dim→Sun(light next), light→Moon(dark next)
|
||||
const Icon =
|
||||
themeMode === 'dark' ? SunMedium :
|
||||
themeMode === 'dim' ? Sun :
|
||||
Moon;
|
||||
|
||||
return (
|
||||
<button
|
||||
<Button
|
||||
onClick={toggleTheme}
|
||||
style={{
|
||||
width: 44,
|
||||
height: 44,
|
||||
borderRadius: 10,
|
||||
background: isDark ? theme.bgHover : theme.bgCard,
|
||||
border: `1px solid ${theme.border}`,
|
||||
cursor: 'pointer',
|
||||
display: 'flex',
|
||||
alignItems: 'center',
|
||||
justifyContent: 'center',
|
||||
transition: 'all 0.3s ease',
|
||||
}}
|
||||
variant="outline"
|
||||
size="icon-lg"
|
||||
className="rounded-xl border-border bg-card text-muted-foreground hover:bg-muted hover:text-foreground"
|
||||
aria-label={`切换到${NEXT_LABELS[themeMode]}`}
|
||||
title={`切换到${NEXT_LABELS[themeMode]}`}
|
||||
>
|
||||
{isDark ? (
|
||||
<svg width="20" height="20" viewBox="0 0 24 24" fill="none">
|
||||
<circle cx="12" cy="12" r="4" fill={theme.accent}/>
|
||||
<path d="M12 2V4M12 20V22M4 12H2M22 12H20M6.34 6.34L4.93 4.93M19.07 19.07L17.66 17.66M6.34 17.66L4.93 19.07M19.07 4.93L17.66 6.34" stroke={theme.accent} strokeWidth="2" strokeLinecap="round"/>
|
||||
</svg>
|
||||
) : (
|
||||
<svg width="20" height="20" viewBox="0 0 24 24" fill="none">
|
||||
<path d="M21 12.79A9 9 0 1 1 11.21 3 7 7 0 0 0 21 12.79z" fill={theme.accent} stroke={theme.accent} strokeWidth="1"/>
|
||||
</svg>
|
||||
)}
|
||||
</button>
|
||||
<Icon />
|
||||
</Button>
|
||||
);
|
||||
};
|
||||
|
||||
22
frontend/src/components/layout/AppShell.tsx
Normal file
22
frontend/src/components/layout/AppShell.tsx
Normal file
@@ -0,0 +1,22 @@
|
||||
import { useLocation } from 'react-router-dom';
|
||||
|
||||
import { FooterLayout } from './FooterLayout';
|
||||
import { HeaderLayout } from './HeaderLayout';
|
||||
import { ContentLayout } from './ContentLayout';
|
||||
import { KeepAliveViewport } from './KeepAliveViewport';
|
||||
import { getTabByPath } from '../../router/tabs';
|
||||
|
||||
export function AppShell() {
|
||||
const location = useLocation();
|
||||
const activeTab = getTabByPath(location.pathname);
|
||||
|
||||
return (
|
||||
<div className="flex min-h-screen flex-col bg-t-bg text-t-text">
|
||||
<HeaderLayout activeTab={activeTab} />
|
||||
<ContentLayout tab={activeTab}>
|
||||
<KeepAliveViewport activeTab={activeTab} />
|
||||
</ContentLayout>
|
||||
<FooterLayout />
|
||||
</div>
|
||||
);
|
||||
}
|
||||
@@ -1,27 +0,0 @@
|
||||
import React from 'react';
|
||||
import { useTheme } from '../../contexts';
|
||||
|
||||
interface ContentProps {
|
||||
children: React.ReactNode;
|
||||
wide?: boolean;
|
||||
}
|
||||
|
||||
export const Content: React.FC<ContentProps> = ({ children, wide = false }) => {
|
||||
const { theme } = useTheme();
|
||||
|
||||
return (
|
||||
<main
|
||||
style={{
|
||||
flex: 1,
|
||||
padding: '48px 56px',
|
||||
maxWidth: wide ? 1400 : 1100,
|
||||
margin: '0 auto',
|
||||
width: '100%',
|
||||
position: 'relative',
|
||||
backgroundColor: theme.bg,
|
||||
}}
|
||||
>
|
||||
{children}
|
||||
</main>
|
||||
);
|
||||
};
|
||||
40
frontend/src/components/layout/ContentLayout.tsx
Normal file
40
frontend/src/components/layout/ContentLayout.tsx
Normal file
@@ -0,0 +1,40 @@
|
||||
import type { ReactNode } from 'react';
|
||||
|
||||
import type { AppTabConfig } from '../../router/tabs';
|
||||
import { shellFrameClassName } from './shell-config';
|
||||
|
||||
interface ContentLayoutProps {
|
||||
children: ReactNode;
|
||||
tab: AppTabConfig;
|
||||
}
|
||||
|
||||
const widthClassMap = {
|
||||
default: 'mx-auto w-full max-w-[1120px]',
|
||||
wide: 'mx-auto w-full max-w-[1440px]',
|
||||
full: 'w-full',
|
||||
} as const;
|
||||
|
||||
export function ContentLayout({ children, tab }: ContentLayoutProps) {
|
||||
const widthClass = widthClassMap[tab.contentWidth];
|
||||
|
||||
return (
|
||||
<main className="flex min-h-0 flex-1 bg-t-bg">
|
||||
<div
|
||||
className={[
|
||||
shellFrameClassName,
|
||||
'relative flex min-h-0 flex-1 justify-center py-8',
|
||||
].join(' ')}
|
||||
>
|
||||
<div
|
||||
className={[
|
||||
'relative flex min-h-0 w-full',
|
||||
widthClass,
|
||||
tab.fillHeight ? 'overflow-hidden' : '',
|
||||
].join(' ')}
|
||||
>
|
||||
{children}
|
||||
</div>
|
||||
</div>
|
||||
</main>
|
||||
);
|
||||
}
|
||||
38
frontend/src/components/layout/FooterLayout.tsx
Normal file
38
frontend/src/components/layout/FooterLayout.tsx
Normal file
@@ -0,0 +1,38 @@
|
||||
import { Badge } from '../shadcn/ui/badge';
|
||||
import { Separator } from '../shadcn/ui/separator';
|
||||
|
||||
import { shellFrameClassName, shellMeta } from './shell-config';
|
||||
|
||||
export function FooterLayout() {
|
||||
return (
|
||||
<footer className="border-t border-t-border bg-t-bg">
|
||||
<div
|
||||
className={[
|
||||
shellFrameClassName,
|
||||
'flex items-center justify-between gap-6 py-4 text-xs text-t-text3',
|
||||
].join(' ')}
|
||||
>
|
||||
<div className="min-w-0 max-w-[360px]">
|
||||
<div className="mono mb-1 tracking-[0.18em] text-t-text2">
|
||||
{shellMeta.productLabel}
|
||||
</div>
|
||||
</div>
|
||||
|
||||
<div className="flex shrink-0 items-center gap-3 whitespace-nowrap rounded-xl border border-border bg-card px-3 py-2 shadow-sm">
|
||||
<Badge variant="secondary" className="mono border-0 bg-transparent px-0 py-0 text-[11px] tracking-[0.24em] text-muted-foreground">
|
||||
{shellMeta.version}
|
||||
</Badge>
|
||||
<Separator orientation="vertical" className="h-4 bg-border" />
|
||||
<Badge variant="outline" className="mono gap-2 border-0 bg-transparent px-0 py-0 text-[11px] tracking-[0.24em] text-[var(--t-green)]">
|
||||
<span className="size-2 rounded-full bg-[var(--t-green)]" />
|
||||
{shellMeta.status}
|
||||
</Badge>
|
||||
<Separator orientation="vertical" className="h-4 bg-border" />
|
||||
<span className="mono text-[11px] tracking-[0.18em] text-muted-foreground">
|
||||
{shellMeta.surface}
|
||||
</span>
|
||||
</div>
|
||||
</div>
|
||||
</footer>
|
||||
);
|
||||
}
|
||||
@@ -1,47 +0,0 @@
|
||||
import React from 'react';
|
||||
import { useTheme } from '../../contexts';
|
||||
import { TLogo } from '../common/TLogo';
|
||||
import { ThemeToggle } from '../common/ThemeToggle';
|
||||
|
||||
export const Header: React.FC = () => {
|
||||
const { theme } = useTheme();
|
||||
|
||||
return (
|
||||
<header
|
||||
className="h-[72px] flex items-center justify-between sticky top-0 z-[100]"
|
||||
style={{
|
||||
padding: '0 48px',
|
||||
borderBottom: `1px solid ${theme.border}`,
|
||||
backgroundColor: theme.bg,
|
||||
}}
|
||||
>
|
||||
<div className="flex items-center" style={{ gap: 20 }}>
|
||||
<TLogo size={80} />
|
||||
<div className="flex items-baseline" style={{ gap: 12 }}>
|
||||
<span style={{ fontWeight: 700, fontSize: 20, letterSpacing: '-0.5px', color: theme.text }}>
|
||||
T-Systems
|
||||
</span>
|
||||
<span style={{ fontWeight: 300, fontSize: 16, color: theme.text2 }}>
|
||||
Regulation
|
||||
</span>
|
||||
</div>
|
||||
</div>
|
||||
<div className="flex items-center" style={{ gap: 16 }}>
|
||||
<ThemeToggle />
|
||||
<div
|
||||
className="flex items-center rounded-lg"
|
||||
style={{
|
||||
padding: '8px 16px',
|
||||
gap: 8,
|
||||
backgroundColor: theme.bgHover,
|
||||
borderRadius: 8,
|
||||
}}
|
||||
>
|
||||
<span className="mono" style={{ fontSize: 11, color: theme.text3 }}>v1.0.0</span>
|
||||
<div style={{ width: 1, height: 12, background: theme.border }} />
|
||||
<span className="mono" style={{ fontSize: 12, color: theme.green }}>● ONLINE</span>
|
||||
</div>
|
||||
</div>
|
||||
</header>
|
||||
);
|
||||
};
|
||||
19
frontend/src/components/layout/HeaderBrand.tsx
Normal file
19
frontend/src/components/layout/HeaderBrand.tsx
Normal file
@@ -0,0 +1,19 @@
|
||||
import { TLogo } from '../common/TLogo';
|
||||
|
||||
export function HeaderBrand() {
|
||||
return (
|
||||
<div className="flex min-w-[280px] shrink-0 items-center gap-4 whitespace-nowrap">
|
||||
<div className="shrink-0">
|
||||
<TLogo size={46} />
|
||||
</div>
|
||||
<div className="flex min-w-0 items-center gap-2 whitespace-nowrap">
|
||||
<span className="text-[1.18rem] font-semibold tracking-[-0.04em] text-foreground">
|
||||
T-Systems
|
||||
</span>
|
||||
<span className="text-[1.02rem] font-light text-muted-foreground">
|
||||
Regulation
|
||||
</span>
|
||||
</div>
|
||||
</div>
|
||||
);
|
||||
}
|
||||
38
frontend/src/components/layout/HeaderLayout.tsx
Normal file
38
frontend/src/components/layout/HeaderLayout.tsx
Normal file
@@ -0,0 +1,38 @@
|
||||
import type { AppTabConfig } from '../../router/tabs';
|
||||
import { ThemeToggle } from '../common/ThemeToggle';
|
||||
import { Badge } from '../shadcn/ui/badge';
|
||||
import { Separator } from '../shadcn/ui/separator';
|
||||
|
||||
import { HeaderBrand } from './HeaderBrand';
|
||||
import { shellFrameClassName, shellMeta } from './shell-config';
|
||||
import { TabNav } from './TabNav';
|
||||
|
||||
interface HeaderLayoutProps {
|
||||
activeTab: AppTabConfig;
|
||||
}
|
||||
|
||||
export function HeaderLayout({ activeTab }: HeaderLayoutProps) {
|
||||
return (
|
||||
<header className="sticky top-0 z-[100] border-b border-border bg-background/95 backdrop-blur supports-[backdrop-filter]:bg-background/80">
|
||||
<div className={[shellFrameClassName, 'flex h-20 items-center gap-8'].join(' ')}>
|
||||
<HeaderBrand />
|
||||
<div className="min-w-0 flex-1 self-stretch overflow-hidden">
|
||||
<TabNav activeTab={activeTab} />
|
||||
</div>
|
||||
<div className="ml-auto flex shrink-0 items-center gap-3 self-center">
|
||||
<ThemeToggle />
|
||||
<div className="flex h-11 shrink-0 items-center gap-3 whitespace-nowrap rounded-xl border border-border bg-card px-3 shadow-sm">
|
||||
<Badge variant="secondary" className="mono border-0 bg-transparent px-0 py-0 text-[11px] tracking-[0.24em] text-muted-foreground">
|
||||
{shellMeta.version}
|
||||
</Badge>
|
||||
<Separator orientation="vertical" className="h-4 bg-border" />
|
||||
<Badge variant="outline" className="mono gap-2 border-0 bg-transparent px-0 py-0 text-[11px] tracking-[0.24em] text-[var(--t-green)]">
|
||||
<span className="size-2 rounded-full bg-[var(--t-green)]" />
|
||||
{shellMeta.status}
|
||||
</Badge>
|
||||
</div>
|
||||
</div>
|
||||
</div>
|
||||
</header>
|
||||
);
|
||||
}
|
||||
45
frontend/src/components/layout/KeepAliveViewport.tsx
Normal file
45
frontend/src/components/layout/KeepAliveViewport.tsx
Normal file
@@ -0,0 +1,45 @@
|
||||
import { useEffect, useState } from 'react';
|
||||
|
||||
import { appTabs, type AppTabConfig } from '../../router/tabs';
|
||||
|
||||
interface KeepAliveViewportProps {
|
||||
activeTab: AppTabConfig;
|
||||
}
|
||||
|
||||
export function KeepAliveViewport({ activeTab }: KeepAliveViewportProps) {
|
||||
const [mountedTabIds, setMountedTabIds] = useState<string[]>([activeTab.id]);
|
||||
|
||||
useEffect(() => {
|
||||
const timerId = window.setTimeout(() => {
|
||||
setMountedTabIds((prev) => (prev.includes(activeTab.id) ? prev : [...prev, activeTab.id]));
|
||||
}, 0);
|
||||
return () => window.clearTimeout(timerId);
|
||||
}, [activeTab.id]);
|
||||
|
||||
return (
|
||||
<div className="flex min-h-0 flex-1">
|
||||
{appTabs.map((tab) => {
|
||||
const shouldRender = tab.keepAlive ? mountedTabIds.includes(tab.id) : tab.id === activeTab.id;
|
||||
if (!shouldRender) {
|
||||
return null;
|
||||
}
|
||||
|
||||
const TabComponent = tab.component;
|
||||
const isActive = tab.id === activeTab.id;
|
||||
|
||||
return (
|
||||
<div
|
||||
key={tab.id}
|
||||
aria-hidden={!isActive}
|
||||
className={[
|
||||
'min-h-0 flex-1',
|
||||
isActive ? 'flex' : 'hidden',
|
||||
].join(' ')}
|
||||
>
|
||||
<TabComponent />
|
||||
</div>
|
||||
);
|
||||
})}
|
||||
</div>
|
||||
);
|
||||
}
|
||||
125
frontend/src/components/layout/TabNav.tsx
Normal file
125
frontend/src/components/layout/TabNav.tsx
Normal file
@@ -0,0 +1,125 @@
|
||||
import { useEffect, useLayoutEffect, useRef, useState } from 'react';
|
||||
import { useNavigate } from 'react-router-dom';
|
||||
|
||||
import type { AppTabConfig, TabId } from '../../router/tabs';
|
||||
import { appTabs } from '../../router/tabs';
|
||||
|
||||
interface TabNavProps {
|
||||
activeTab: AppTabConfig;
|
||||
}
|
||||
|
||||
interface IndicatorStyle {
|
||||
opacity: number;
|
||||
transform: string;
|
||||
width: number;
|
||||
}
|
||||
|
||||
const reducedMotionQuery = '(prefers-reduced-motion: reduce)';
|
||||
|
||||
export function TabNav({ activeTab }: TabNavProps) {
|
||||
const navigate = useNavigate();
|
||||
const trackRef = useRef<HTMLDivElement | null>(null);
|
||||
const buttonRefs = useRef<Record<TabId, HTMLButtonElement | null>>({
|
||||
perception: null,
|
||||
docs: null,
|
||||
compliance: null,
|
||||
status: null,
|
||||
rag: null,
|
||||
});
|
||||
const [indicatorStyle, setIndicatorStyle] = useState<IndicatorStyle>({
|
||||
opacity: 0,
|
||||
transform: 'translateX(0px)',
|
||||
width: 0,
|
||||
});
|
||||
const [reducedMotion, setReducedMotion] = useState(false);
|
||||
|
||||
const handleValueChange = (value: string) => {
|
||||
const nextTab = appTabs.find((tab) => tab.id === value);
|
||||
if (nextTab && nextTab.path !== activeTab.path) {
|
||||
navigate(nextTab.path);
|
||||
}
|
||||
};
|
||||
|
||||
useEffect(() => {
|
||||
const mediaQuery = window.matchMedia(reducedMotionQuery);
|
||||
const updateMotionPreference = () => {
|
||||
setReducedMotion(mediaQuery.matches);
|
||||
};
|
||||
|
||||
updateMotionPreference();
|
||||
mediaQuery.addEventListener('change', updateMotionPreference);
|
||||
|
||||
return () => {
|
||||
mediaQuery.removeEventListener('change', updateMotionPreference);
|
||||
};
|
||||
}, []);
|
||||
|
||||
useLayoutEffect(() => {
|
||||
const updateIndicator = () => {
|
||||
const trackNode = trackRef.current;
|
||||
const activeNode = buttonRefs.current[activeTab.id];
|
||||
if (!trackNode || !activeNode) {
|
||||
return;
|
||||
}
|
||||
|
||||
const trackRect = trackNode.getBoundingClientRect();
|
||||
const activeRect = activeNode.getBoundingClientRect();
|
||||
|
||||
setIndicatorStyle({
|
||||
opacity: 1,
|
||||
transform: `translateX(${activeRect.left - trackRect.left}px)`,
|
||||
width: activeRect.width,
|
||||
});
|
||||
};
|
||||
|
||||
updateIndicator();
|
||||
window.addEventListener('resize', updateIndicator);
|
||||
|
||||
return () => {
|
||||
window.removeEventListener('resize', updateIndicator);
|
||||
};
|
||||
}, [activeTab.id]);
|
||||
|
||||
return (
|
||||
<nav className="flex h-full min-w-0 items-stretch overflow-x-auto overflow-y-hidden">
|
||||
<div
|
||||
ref={trackRef}
|
||||
className="relative flex h-full min-w-max flex-nowrap items-stretch gap-3 pr-6"
|
||||
>
|
||||
<div
|
||||
aria-hidden="true"
|
||||
className={[
|
||||
'pointer-events-none absolute bottom-0 left-0 h-0.5 rounded-full bg-primary',
|
||||
reducedMotion
|
||||
? 'transition-none'
|
||||
: 'transition-[transform,width,opacity] duration-220 ease-[cubic-bezier(0.22,1,0.36,1)]',
|
||||
].join(' ')}
|
||||
style={indicatorStyle}
|
||||
/>
|
||||
{appTabs.map((tab) => (
|
||||
<button
|
||||
key={tab.id}
|
||||
ref={(node) => {
|
||||
buttonRefs.current[tab.id] = node;
|
||||
}}
|
||||
data-shell-tab="true"
|
||||
type="button"
|
||||
onClick={() => handleValueChange(tab.id satisfies TabId)}
|
||||
aria-current={tab.id === activeTab.id ? 'page' : undefined}
|
||||
className={[
|
||||
'inline-flex h-full shrink-0 appearance-none items-center justify-center border-0 border-b-2 border-transparent bg-transparent px-5 pt-1 text-[0.95rem] font-medium tracking-[0.02em] outline-none',
|
||||
reducedMotion
|
||||
? 'transition-none'
|
||||
: 'transition-[color,opacity] duration-200 ease-out',
|
||||
tab.id === activeTab.id
|
||||
? 'text-foreground'
|
||||
: 'text-muted-foreground hover:text-foreground',
|
||||
].join(' ')}
|
||||
>
|
||||
{tab.label}
|
||||
</button>
|
||||
))}
|
||||
</div>
|
||||
</nav>
|
||||
);
|
||||
}
|
||||
@@ -1,48 +0,0 @@
|
||||
import React from 'react';
|
||||
import { useTheme, useApp } from '../../contexts';
|
||||
import type { TabId } from '../../contexts';
|
||||
|
||||
const tabs: Array<{ id: TabId; label: string }> = [
|
||||
{ id: 'docs', label: '文档管理' },
|
||||
{ id: 'compliance', label: '合规分析' },
|
||||
{ id: 'status', label: '系统状态' },
|
||||
{ id: 'rag', label: '法规对话' },
|
||||
];
|
||||
|
||||
export const Tabs: React.FC = () => {
|
||||
const { theme } = useTheme();
|
||||
const { activeTab, setActiveTab } = useApp();
|
||||
|
||||
return (
|
||||
<nav
|
||||
className="h-[56px] flex items-center"
|
||||
style={{
|
||||
padding: '0 48px',
|
||||
borderBottom: `1px solid ${theme.border}`,
|
||||
backgroundColor: theme.bg,
|
||||
}}
|
||||
>
|
||||
{tabs.map((tab) => (
|
||||
<button
|
||||
key={tab.id}
|
||||
onClick={() => setActiveTab(tab.id)}
|
||||
style={{
|
||||
height: 56,
|
||||
padding: '0 32px',
|
||||
fontSize: 15,
|
||||
fontWeight: activeTab === tab.id ? 600 : 400,
|
||||
color: activeTab === tab.id ? theme.accent : theme.text3,
|
||||
background: 'transparent',
|
||||
border: 'none',
|
||||
borderBottom: activeTab === tab.id ? `3px solid ${theme.accent}` : '3px solid transparent',
|
||||
marginBottom: -1,
|
||||
cursor: 'pointer',
|
||||
transition: 'all 0.2s ease',
|
||||
}}
|
||||
>
|
||||
{tab.label}
|
||||
</button>
|
||||
))}
|
||||
</nav>
|
||||
);
|
||||
};
|
||||
@@ -1,3 +1,4 @@
|
||||
export { Header } from './Header';
|
||||
export { Tabs } from './Tabs';
|
||||
export { Content } from './Content';
|
||||
export { AppShell } from './AppShell';
|
||||
export { ContentLayout } from './ContentLayout';
|
||||
export { FooterLayout } from './FooterLayout';
|
||||
export { HeaderLayout } from './HeaderLayout';
|
||||
|
||||
12
frontend/src/components/layout/shell-config.ts
Normal file
12
frontend/src/components/layout/shell-config.ts
Normal file
@@ -0,0 +1,12 @@
|
||||
import { appTabs } from '../../router/tabs';
|
||||
|
||||
export const shellFrameClassName = 'mx-auto w-full max-w-[1680px] px-8';
|
||||
|
||||
export const shellMeta = {
|
||||
productLabel: 'T-Systems Regulation',
|
||||
version: 'v1.0.0',
|
||||
status: 'ONLINE',
|
||||
surface: 'Desktop Web',
|
||||
} as const;
|
||||
|
||||
export const shellModuleSummary = appTabs.map((tab) => tab.label).join(' / ');
|
||||
30
frontend/src/components/shadcn/ui/badge.tsx
Normal file
30
frontend/src/components/shadcn/ui/badge.tsx
Normal file
@@ -0,0 +1,30 @@
|
||||
import { cva, type VariantProps } from 'class-variance-authority';
|
||||
import type * as React from 'react';
|
||||
|
||||
import { cn } from '@/lib/utils';
|
||||
|
||||
const badgeVariants = cva(
|
||||
'inline-flex items-center rounded-md border px-2 py-1 text-[11px] font-medium tracking-[0.22em] uppercase transition-colors',
|
||||
{
|
||||
variants: {
|
||||
variant: {
|
||||
default: 'border-primary/30 bg-primary/10 text-primary',
|
||||
secondary: 'border-border bg-muted text-muted-foreground',
|
||||
outline: 'border-border bg-transparent text-foreground',
|
||||
},
|
||||
},
|
||||
defaultVariants: {
|
||||
variant: 'default',
|
||||
},
|
||||
},
|
||||
);
|
||||
|
||||
function Badge({
|
||||
className,
|
||||
variant,
|
||||
...props
|
||||
}: React.ComponentProps<'span'> & VariantProps<typeof badgeVariants>) {
|
||||
return <span className={cn(badgeVariants({ variant }), className)} {...props} />;
|
||||
}
|
||||
|
||||
export { Badge };
|
||||
65
frontend/src/components/shadcn/ui/button.tsx
Normal file
65
frontend/src/components/shadcn/ui/button.tsx
Normal file
@@ -0,0 +1,65 @@
|
||||
import * as React from 'react';
|
||||
import { cva, type VariantProps } from 'class-variance-authority';
|
||||
import { Slot } from 'radix-ui';
|
||||
|
||||
import { cn } from '@/lib/utils';
|
||||
|
||||
const buttonVariants = cva(
|
||||
'group/button inline-flex shrink-0 items-center justify-center rounded-lg border border-transparent bg-clip-padding text-sm font-medium whitespace-nowrap transition-all outline-none select-none focus-visible:border-ring focus-visible:ring-3 focus-visible:ring-ring/50 active:not-aria-[haspopup]:translate-y-px disabled:pointer-events-none disabled:opacity-50 aria-invalid:border-destructive aria-invalid:ring-3 aria-invalid:ring-destructive/20 dark:aria-invalid:border-destructive/50 dark:aria-invalid:ring-destructive/40 [&_svg]:pointer-events-none [&_svg]:shrink-0 [&_svg:not([class*=size-])]:size-4',
|
||||
{
|
||||
variants: {
|
||||
variant: {
|
||||
default: 'bg-primary text-primary-foreground hover:bg-primary/90',
|
||||
outline:
|
||||
'border-border bg-background hover:bg-muted hover:text-foreground aria-expanded:bg-muted aria-expanded:text-foreground dark:bg-input/30 dark:hover:bg-input/50',
|
||||
secondary:
|
||||
'bg-secondary text-secondary-foreground hover:bg-secondary/80 aria-expanded:bg-secondary aria-expanded:text-secondary-foreground',
|
||||
ghost:
|
||||
'hover:bg-muted hover:text-foreground aria-expanded:bg-muted aria-expanded:text-foreground dark:hover:bg-muted/50',
|
||||
destructive:
|
||||
'bg-destructive/10 text-destructive hover:bg-destructive/20 focus-visible:border-destructive/40 focus-visible:ring-destructive/20 dark:bg-destructive/20 dark:hover:bg-destructive/30 dark:focus-visible:ring-destructive/40',
|
||||
link: 'text-primary underline-offset-4 hover:underline',
|
||||
},
|
||||
size: {
|
||||
default:
|
||||
'h-8 gap-1.5 px-2.5 has-data-[icon=inline-end]:pr-2 has-data-[icon=inline-start]:pl-2',
|
||||
xs: 'h-6 gap-1 rounded-[min(var(--radius-md),10px)] px-2 text-xs has-data-[icon=inline-end]:pr-1.5 has-data-[icon=inline-start]:pl-1.5 [&_svg:not([class*=size-])]:size-3',
|
||||
sm: 'h-7 gap-1 rounded-[min(var(--radius-md),12px)] px-2.5 text-[0.8rem] has-data-[icon=inline-end]:pr-1.5 has-data-[icon=inline-start]:pl-1.5 [&_svg:not([class*=size-])]:size-3.5',
|
||||
lg: 'h-9 gap-1.5 px-3 has-data-[icon=inline-end]:pr-2 has-data-[icon=inline-start]:pl-2',
|
||||
icon: 'size-8',
|
||||
'icon-xs': 'size-6 rounded-[min(var(--radius-md),10px)] [&_svg:not([class*=size-])]:size-3',
|
||||
'icon-sm': 'size-7 rounded-[min(var(--radius-md),12px)]',
|
||||
'icon-lg': 'size-9',
|
||||
},
|
||||
},
|
||||
defaultVariants: {
|
||||
variant: 'default',
|
||||
size: 'default',
|
||||
},
|
||||
},
|
||||
);
|
||||
|
||||
function Button({
|
||||
className,
|
||||
variant,
|
||||
size,
|
||||
asChild = false,
|
||||
...props
|
||||
}: React.ComponentProps<'button'> &
|
||||
VariantProps<typeof buttonVariants> & {
|
||||
asChild?: boolean;
|
||||
}) {
|
||||
const Comp = asChild ? Slot.Root : 'button';
|
||||
|
||||
return (
|
||||
<Comp
|
||||
data-slot="button"
|
||||
data-variant={variant}
|
||||
data-size={size}
|
||||
className={cn(buttonVariants({ variant, size, className }))}
|
||||
{...props}
|
||||
/>
|
||||
);
|
||||
}
|
||||
|
||||
export { Button };
|
||||
27
frontend/src/components/shadcn/ui/separator.tsx
Normal file
27
frontend/src/components/shadcn/ui/separator.tsx
Normal file
@@ -0,0 +1,27 @@
|
||||
import * as React from 'react';
|
||||
import { Separator as SeparatorPrimitive } from 'radix-ui';
|
||||
|
||||
import { cn } from '@/lib/utils';
|
||||
|
||||
function Separator({
|
||||
className,
|
||||
orientation = 'horizontal',
|
||||
decorative = true,
|
||||
...props
|
||||
}: React.ComponentProps<typeof SeparatorPrimitive.Root>) {
|
||||
return (
|
||||
<SeparatorPrimitive.Root
|
||||
data-slot="separator-root"
|
||||
decorative={decorative}
|
||||
orientation={orientation}
|
||||
className={cn(
|
||||
'shrink-0 bg-border',
|
||||
orientation === 'horizontal' ? 'h-px w-full' : 'h-full w-px',
|
||||
className,
|
||||
)}
|
||||
{...props}
|
||||
/>
|
||||
);
|
||||
}
|
||||
|
||||
export { Separator };
|
||||
48
frontend/src/components/shadcn/ui/tabs.tsx
Normal file
48
frontend/src/components/shadcn/ui/tabs.tsx
Normal file
@@ -0,0 +1,48 @@
|
||||
import * as React from 'react';
|
||||
import { Tabs as TabsPrimitive } from 'radix-ui';
|
||||
|
||||
import { cn } from '@/lib/utils';
|
||||
|
||||
function Tabs({
|
||||
className,
|
||||
...props
|
||||
}: React.ComponentProps<typeof TabsPrimitive.Root>) {
|
||||
return (
|
||||
<TabsPrimitive.Root
|
||||
data-slot="tabs"
|
||||
className={cn('w-full', className)}
|
||||
{...props}
|
||||
/>
|
||||
);
|
||||
}
|
||||
|
||||
function TabsList({
|
||||
className,
|
||||
...props
|
||||
}: React.ComponentProps<typeof TabsPrimitive.List>) {
|
||||
return (
|
||||
<TabsPrimitive.List
|
||||
data-slot="tabs-list"
|
||||
className={cn('inline-flex items-center gap-2', className)}
|
||||
{...props}
|
||||
/>
|
||||
);
|
||||
}
|
||||
|
||||
function TabsTrigger({
|
||||
className,
|
||||
...props
|
||||
}: React.ComponentProps<typeof TabsPrimitive.Trigger>) {
|
||||
return (
|
||||
<TabsPrimitive.Trigger
|
||||
data-slot="tabs-trigger"
|
||||
className={cn(
|
||||
'inline-flex items-center justify-center whitespace-nowrap outline-none transition-colors focus-visible:ring-2 focus-visible:ring-ring focus-visible:ring-offset-2 disabled:pointer-events-none disabled:opacity-50',
|
||||
className,
|
||||
)}
|
||||
{...props}
|
||||
/>
|
||||
);
|
||||
}
|
||||
|
||||
export { Tabs, TabsList, TabsTrigger };
|
||||
@@ -1,17 +0,0 @@
|
||||
import { useState, type ReactNode } from 'react';
|
||||
|
||||
import { AppContext, type TabId } from './app-context';
|
||||
|
||||
interface AppProviderProps {
|
||||
children: ReactNode;
|
||||
}
|
||||
|
||||
export const AppProvider: React.FC<AppProviderProps> = ({ children }) => {
|
||||
const [activeTab, setActiveTab] = useState<TabId>('compliance');
|
||||
|
||||
return (
|
||||
<AppContext.Provider value={{ activeTab, setActiveTab }}>
|
||||
{children}
|
||||
</AppContext.Provider>
|
||||
);
|
||||
};
|
||||
@@ -1,33 +1,58 @@
|
||||
import React, { useEffect, useState, type ReactNode } from 'react';
|
||||
|
||||
import { darkTheme, lightTheme } from '../types/theme';
|
||||
import { darkTheme, dimTheme, lightTheme, type ThemeMode } from '../types/theme';
|
||||
import { ThemeContext } from './theme-context';
|
||||
|
||||
const STORAGE_KEY = 'app-theme-mode';
|
||||
|
||||
function getInitialMode(): ThemeMode {
|
||||
try {
|
||||
const stored = localStorage.getItem(STORAGE_KEY);
|
||||
if (stored === 'dark' || stored === 'dim' || stored === 'light') return stored;
|
||||
} catch {
|
||||
// ignore
|
||||
}
|
||||
return 'dark';
|
||||
}
|
||||
|
||||
const THEME_MAP = { dark: darkTheme, dim: dimTheme, light: lightTheme };
|
||||
const BG_MAP: Record<ThemeMode, string> = {
|
||||
dark: '#0a0a12',
|
||||
dim: '#1e1b2e',
|
||||
light: '#ffffff',
|
||||
};
|
||||
|
||||
interface ThemeProviderProps {
|
||||
children: ReactNode;
|
||||
}
|
||||
|
||||
export const ThemeProvider: React.FC<ThemeProviderProps> = ({ children }) => {
|
||||
const [isDark, setIsDark] = useState<boolean>(true);
|
||||
const theme = isDark ? darkTheme : lightTheme;
|
||||
const [themeMode, setThemeMode] = useState<ThemeMode>(getInitialMode);
|
||||
const theme = THEME_MAP[themeMode];
|
||||
const isDark = themeMode === 'dark';
|
||||
|
||||
const toggleTheme = () => {
|
||||
setIsDark((prev) => !prev);
|
||||
setThemeMode((prev) =>
|
||||
prev === 'dark' ? 'dim' : prev === 'dim' ? 'light' : 'dark'
|
||||
);
|
||||
};
|
||||
|
||||
useEffect(() => {
|
||||
if (isDark) {
|
||||
document.documentElement.classList.add('dark');
|
||||
document.body.style.background = '#0a0a12';
|
||||
return;
|
||||
}
|
||||
const root = document.documentElement;
|
||||
root.classList.remove('dark', 'dim');
|
||||
if (themeMode !== 'light') root.classList.add(themeMode);
|
||||
|
||||
document.documentElement.classList.remove('dark');
|
||||
document.body.style.background = '#ffffff';
|
||||
}, [isDark]);
|
||||
document.body.style.background = BG_MAP[themeMode];
|
||||
|
||||
try {
|
||||
localStorage.setItem(STORAGE_KEY, themeMode);
|
||||
} catch {
|
||||
// ignore
|
||||
}
|
||||
}, [themeMode]);
|
||||
|
||||
return (
|
||||
<ThemeContext.Provider value={{ isDark, theme, toggleTheme }}>
|
||||
<ThemeContext.Provider value={{ isDark, themeMode, theme, toggleTheme }}>
|
||||
{children}
|
||||
</ThemeContext.Provider>
|
||||
);
|
||||
|
||||
@@ -1,10 +0,0 @@
|
||||
import { createContext } from 'react';
|
||||
|
||||
export type TabId = 'docs' | 'compliance' | 'status' | 'rag';
|
||||
|
||||
export interface AppContextValue {
|
||||
activeTab: TabId;
|
||||
setActiveTab: (tab: TabId) => void;
|
||||
}
|
||||
|
||||
export const AppContext = createContext<AppContextValue | undefined>(undefined);
|
||||
@@ -1,5 +1,2 @@
|
||||
export { ThemeProvider } from './ThemeContext';
|
||||
export { useTheme } from './useTheme';
|
||||
export { AppProvider } from './AppContext';
|
||||
export type { AppContextValue, TabId } from './app-context';
|
||||
export { useApp } from './useApp';
|
||||
|
||||
@@ -1,9 +1,10 @@
|
||||
import { createContext } from 'react';
|
||||
|
||||
import type { ThemeColors } from '../types/theme';
|
||||
import type { ThemeColors, ThemeMode } from '../types/theme';
|
||||
|
||||
export interface ThemeContextValue {
|
||||
isDark: boolean;
|
||||
themeMode: ThemeMode;
|
||||
theme: ThemeColors;
|
||||
toggleTheme: () => void;
|
||||
}
|
||||
|
||||
@@ -1,11 +0,0 @@
|
||||
import { useContext } from 'react';
|
||||
|
||||
import { AppContext, type AppContextValue } from './app-context';
|
||||
|
||||
export function useApp(): AppContextValue {
|
||||
const context = useContext(AppContext);
|
||||
if (!context) {
|
||||
throw new Error('useApp must be used within an AppProvider');
|
||||
}
|
||||
return context;
|
||||
}
|
||||
@@ -587,7 +587,7 @@ export const CompliancePage: React.FC = () => {
|
||||
flex: 1,
|
||||
display: 'flex',
|
||||
height: '100%',
|
||||
minHeight: 'calc(100vh - 128px)',
|
||||
minHeight: 0,
|
||||
position: 'relative',
|
||||
}}>
|
||||
{/* Main Content Area */}
|
||||
|
||||
@@ -1,6 +1,5 @@
|
||||
import React, { useEffect, useRef, useState } from 'react';
|
||||
import { useTheme } from '../../contexts';
|
||||
import { Content } from '../../components/layout/Content';
|
||||
import { TPattern } from '../../components/common/TPattern';
|
||||
import { getDocumentList, getDocumentStatus, searchRegulations, uploadDocument, deleteDocument, retryDocument, type RegulationSearchItem } from '../../api/docs';
|
||||
import type { Doc } from '../../types';
|
||||
@@ -40,6 +39,7 @@ export const DocsPage: React.FC = () => {
|
||||
const [searchResults, setSearchResults] = useState<RegulationSearchItem[]>([]);
|
||||
const [searchLoading, setSearchLoading] = useState(false);
|
||||
const [searchError, setSearchError] = useState('');
|
||||
const [batchQueueLength, setBatchQueueLength] = useState(0);
|
||||
|
||||
// Upload metadata
|
||||
const [regulationType, setRegulationType] = useState('');
|
||||
@@ -48,12 +48,17 @@ export const DocsPage: React.FC = () => {
|
||||
// Batch queue: files waiting to be uploaded after the current one finishes
|
||||
const batchQueueRef = useRef<File[]>([]);
|
||||
|
||||
const setBatchQueue = (files: File[]) => {
|
||||
batchQueueRef.current = files;
|
||||
setBatchQueueLength(files.length);
|
||||
};
|
||||
|
||||
async function loadDocuments() {
|
||||
setLoading(true);
|
||||
try {
|
||||
const response = await getDocumentList();
|
||||
const apiDocs: Doc[] = response.docs.map((doc) => ({
|
||||
id: parseInt(String(doc.id).replace('doc-', ''), 10) || Math.floor(Math.random() * 10000),
|
||||
const apiDocs: Doc[] = response.docs.map((doc, index) => ({
|
||||
id: Number.parseInt(String(doc.id).replace('doc-', ''), 10) || -(index + 1),
|
||||
name: doc.name,
|
||||
chunks: doc.chunks,
|
||||
size: doc.updated_at ? new Date(doc.updated_at).toLocaleString() : 'Indexed document',
|
||||
@@ -209,6 +214,7 @@ export const DocsPage: React.FC = () => {
|
||||
|
||||
// Process next file in batch queue
|
||||
const next = batchQueueRef.current.shift();
|
||||
setBatchQueueLength(batchQueueRef.current.length);
|
||||
if (next) {
|
||||
const nextRunId = pipelineRunIdRef.current + 1;
|
||||
pipelineRunIdRef.current = nextRunId;
|
||||
@@ -222,7 +228,7 @@ export const DocsPage: React.FC = () => {
|
||||
if (files.length === 0 || uploading) return;
|
||||
|
||||
const [first, ...rest] = files;
|
||||
batchQueueRef.current = rest;
|
||||
setBatchQueue(rest);
|
||||
|
||||
const runId = pipelineRunIdRef.current + 1;
|
||||
pipelineRunIdRef.current = runId;
|
||||
@@ -262,7 +268,7 @@ export const DocsPage: React.FC = () => {
|
||||
if (files.length === 0 || uploading) return;
|
||||
|
||||
const [first, ...rest] = files;
|
||||
batchQueueRef.current = rest;
|
||||
setBatchQueue(rest);
|
||||
const runId = pipelineRunIdRef.current + 1;
|
||||
pipelineRunIdRef.current = runId;
|
||||
void uploadSingleFile(first, runId);
|
||||
@@ -282,8 +288,7 @@ export const DocsPage: React.FC = () => {
|
||||
|
||||
const getPipelineHint = () => {
|
||||
if (pipelineStatus === 'running') {
|
||||
const queueLen = batchQueueRef.current.length;
|
||||
const suffix = queueLen > 0 ? ` (+${queueLen} 待上传)` : '';
|
||||
const suffix = batchQueueLength > 0 ? ` (+${batchQueueLength} 待上传)` : '';
|
||||
return `${activeStep >= 0 ? PIPELINE_STEPS[activeStep].name : 'LOAD'} · ${uploadFileName}${suffix}`;
|
||||
}
|
||||
if (pipelineStatus === 'completed') return 'PIPELINE COMPLETE';
|
||||
@@ -291,6 +296,11 @@ export const DocsPage: React.FC = () => {
|
||||
return 'WAITING FOR UPLOAD';
|
||||
};
|
||||
|
||||
const getDocKey = (doc: Doc) => {
|
||||
// Prefer the backend document identifier because the numeric display id is not guaranteed unique.
|
||||
return doc.docId ?? `local-${doc.id}-${doc.name}`;
|
||||
};
|
||||
|
||||
const inputStyle: React.CSSProperties = {
|
||||
padding: '8px 12px',
|
||||
fontSize: 13,
|
||||
@@ -302,7 +312,7 @@ export const DocsPage: React.FC = () => {
|
||||
};
|
||||
|
||||
return (
|
||||
<Content>
|
||||
<div className="relative w-full">
|
||||
<TPattern />
|
||||
|
||||
<section style={{ marginBottom: 56 }}>
|
||||
@@ -432,7 +442,7 @@ export const DocsPage: React.FC = () => {
|
||||
<div style={{ display: 'flex', flexDirection: 'column', gap: 12 }}>
|
||||
{docs.map((doc) => (
|
||||
<div
|
||||
key={doc.id}
|
||||
key={getDocKey(doc)}
|
||||
style={{ display: 'flex', alignItems: 'center', justifyContent: 'space-between', padding: 20, background: theme.bgCard, borderRadius: 12, border: `1px solid ${doc.status === 'parsing' ? theme.accent : theme.border}`, transition: 'all 0.2s ease', boxShadow: !isDark ? '0 2px 8px rgba(226,0,116,0.04)' : 'none' }}
|
||||
>
|
||||
<div style={{ display: 'flex', alignItems: 'flex-start', gap: 16 }}>
|
||||
@@ -543,6 +553,6 @@ export const DocsPage: React.FC = () => {
|
||||
</div>
|
||||
</div>
|
||||
</section>
|
||||
</Content>
|
||||
</div>
|
||||
);
|
||||
};
|
||||
|
||||
207
frontend/src/pages/Perception/AnalysisPanel.tsx
Normal file
207
frontend/src/pages/Perception/AnalysisPanel.tsx
Normal file
@@ -0,0 +1,207 @@
|
||||
import React, { useRef } from 'react';
|
||||
import { useTheme } from '../../contexts';
|
||||
import type { RegulationEvent, AffectedDoc } from '../../api/perception';
|
||||
|
||||
interface AnalysisPanelProps {
|
||||
event: RegulationEvent | null;
|
||||
analyzing: boolean;
|
||||
analysisText: string;
|
||||
affectedDocs: AffectedDoc[];
|
||||
onAnalyze: () => void;
|
||||
onAbort: () => void;
|
||||
}
|
||||
|
||||
// Minimal markdown renderer — handles ##/### headings, **bold**, bullet lists
|
||||
function MarkdownText({ text, textColor, accent }: { text: string; textColor: string; accent: string }) {
|
||||
const lines = text.split('\n');
|
||||
return (
|
||||
<div style={{ fontSize: 14, lineHeight: 1.75, color: textColor }}>
|
||||
{lines.map((line, i) => {
|
||||
if (line.startsWith('## ')) {
|
||||
return <div key={i} style={{ fontSize: 15, fontWeight: 700, color: accent, marginTop: 18, marginBottom: 6 }}>{line.slice(3)}</div>;
|
||||
}
|
||||
if (line.startsWith('### ')) {
|
||||
return <div key={i} style={{ fontSize: 13, fontWeight: 700, marginTop: 12, marginBottom: 4 }}>{line.slice(4)}</div>;
|
||||
}
|
||||
if (line.startsWith('- ') || line.startsWith('* ')) {
|
||||
const content = line.slice(2);
|
||||
return (
|
||||
<div key={i} style={{ display: 'flex', gap: 8, marginBottom: 4, paddingLeft: 8 }}>
|
||||
<span style={{ color: accent, flexShrink: 0 }}>·</span>
|
||||
<span dangerouslySetInnerHTML={{ __html: content.replace(/\*\*(.+?)\*\*/g, '<strong>$1</strong>') }} />
|
||||
</div>
|
||||
);
|
||||
}
|
||||
if (/^\d+\./.test(line)) {
|
||||
return (
|
||||
<div key={i} style={{ marginBottom: 4, paddingLeft: 8 }}>
|
||||
<span dangerouslySetInnerHTML={{ __html: line.replace(/\*\*(.+?)\*\*/g, '<strong>$1</strong>') }} />
|
||||
</div>
|
||||
);
|
||||
}
|
||||
if (!line.trim()) return <div key={i} style={{ height: 8 }} />;
|
||||
return (
|
||||
<div key={i} style={{ marginBottom: 4 }}>
|
||||
<span dangerouslySetInnerHTML={{ __html: line.replace(/\*\*(.+?)\*\*/g, '<strong>$1</strong>') }} />
|
||||
</div>
|
||||
);
|
||||
})}
|
||||
</div>
|
||||
);
|
||||
}
|
||||
|
||||
const IMPACT_COLORS = { high: '#d64545', medium: '#ff8800', low: '#00d4aa' };
|
||||
const SOURCE_COLORS: Record<string, string> = {
|
||||
MIIT: '#e20074', 'UN-ECE': '#4a90d9', ISO: '#7b68ee',
|
||||
'国标委': '#00b89c', 'EUR-Lex': '#f5a623', IATF: '#9b59b6',
|
||||
};
|
||||
const STATUS_LABEL: Record<string, string> = { enacted: '已生效', draft: '征求意见', consultation: '公众咨询' };
|
||||
|
||||
export const AnalysisPanel: React.FC<AnalysisPanelProps> = ({
|
||||
event, analyzing, analysisText, affectedDocs, onAnalyze, onAbort,
|
||||
}) => {
|
||||
const { theme, isDark } = useTheme();
|
||||
const analysisRef = useRef<HTMLDivElement>(null);
|
||||
|
||||
if (!event) {
|
||||
return (
|
||||
<div style={{ height: '100%', display: 'flex', flexDirection: 'column', alignItems: 'center', justifyContent: 'center', gap: 12 }}>
|
||||
<div style={{ fontSize: 48, opacity: 0.15 }}>◈</div>
|
||||
<div style={{ fontSize: 14, color: theme.text3 }}>选择左侧法规动态以查看智能影响分析</div>
|
||||
</div>
|
||||
);
|
||||
}
|
||||
|
||||
const impactColor = IMPACT_COLORS[event.impact_level];
|
||||
const srcColor = SOURCE_COLORS[event.source] || theme.accent;
|
||||
|
||||
return (
|
||||
<div style={{ display: 'flex', flexDirection: 'column', height: '100%', gap: 0 }}>
|
||||
{/* Event header */}
|
||||
<div style={{
|
||||
padding: '20px 24px',
|
||||
background: theme.bgCard,
|
||||
borderRadius: 12,
|
||||
border: `1px solid ${theme.border}`,
|
||||
borderLeft: `4px solid ${impactColor}`,
|
||||
marginBottom: 16,
|
||||
flexShrink: 0,
|
||||
boxShadow: !isDark ? '0 2px 8px rgba(226,0,116,0.04)' : 'none',
|
||||
}}>
|
||||
{/* Source + status */}
|
||||
<div style={{ display: 'flex', alignItems: 'center', gap: 8, marginBottom: 10 }}>
|
||||
<span style={{ fontSize: 11, fontWeight: 700, color: srcColor, background: srcColor + '18', borderRadius: 4, padding: '3px 8px' }}>{event.source}</span>
|
||||
<span className="mono" style={{ fontSize: 11, color: theme.text3 }}>{event.standard_code}</span>
|
||||
<span style={{ marginLeft: 'auto', fontSize: 11, color: event.status === 'enacted' ? theme.green : '#ff8800', fontWeight: 600 }}>
|
||||
{STATUS_LABEL[event.status] ?? event.status}
|
||||
</span>
|
||||
</div>
|
||||
|
||||
{/* Title */}
|
||||
<div style={{ fontSize: 16, fontWeight: 700, color: theme.text, lineHeight: 1.4, marginBottom: 10 }}>
|
||||
{event.title}
|
||||
</div>
|
||||
|
||||
{/* Summary */}
|
||||
<div style={{ fontSize: 13, color: theme.text2, lineHeight: 1.6, marginBottom: 12 }}>
|
||||
{event.summary}
|
||||
</div>
|
||||
|
||||
{/* Tags */}
|
||||
<div style={{ display: 'flex', flexWrap: 'wrap', gap: 6, marginBottom: 12 }}>
|
||||
{event.tags.map(tag => (
|
||||
<span key={tag} style={{ fontSize: 11, color: theme.text3, background: theme.bgHover, borderRadius: 4, padding: '2px 8px', border: `1px solid ${theme.border}` }}>
|
||||
{tag}
|
||||
</span>
|
||||
))}
|
||||
</div>
|
||||
|
||||
{/* Dates + Analyze button */}
|
||||
<div style={{ display: 'flex', alignItems: 'center', justifyContent: 'space-between' }}>
|
||||
<div className="mono" style={{ fontSize: 11, color: theme.text3 }}>
|
||||
发布:{event.published_at}
|
||||
{event.effective_at && <span style={{ marginLeft: 12 }}>生效:<span style={{ color: impactColor }}>{event.effective_at}</span></span>}
|
||||
</div>
|
||||
{analyzing ? (
|
||||
<button onClick={onAbort} style={{ padding: '7px 18px', borderRadius: 8, border: '1px solid #d64545', background: 'transparent', color: '#d64545', cursor: 'pointer', fontSize: 13, fontWeight: 600 }}>
|
||||
■ 停止
|
||||
</button>
|
||||
) : (
|
||||
<button onClick={onAnalyze} style={{ padding: '7px 18px', borderRadius: 8, border: 'none', background: theme.gradientAccent, color: '#fff', cursor: 'pointer', fontSize: 13, fontWeight: 600, boxShadow: '0 2px 8px rgba(226,0,116,0.3)' }}>
|
||||
⚡ 触发智能分析
|
||||
</button>
|
||||
)}
|
||||
</div>
|
||||
</div>
|
||||
|
||||
{/* Affected documents */}
|
||||
{affectedDocs.length > 0 && (
|
||||
<div style={{ marginBottom: 16, flexShrink: 0 }}>
|
||||
<div className="mono" style={{ fontSize: 11, color: theme.text3, letterSpacing: '1px', marginBottom: 8 }}>
|
||||
关联文档({affectedDocs.length})
|
||||
</div>
|
||||
<div style={{ display: 'flex', flexDirection: 'column', gap: 6 }}>
|
||||
{affectedDocs.map(doc => (
|
||||
<div key={doc.doc_id} style={{
|
||||
padding: '10px 14px',
|
||||
background: theme.bgCard,
|
||||
border: `1px solid ${theme.border}`,
|
||||
borderLeft: `3px solid ${theme.accent}`,
|
||||
borderRadius: 8,
|
||||
display: 'flex',
|
||||
alignItems: 'flex-start',
|
||||
gap: 10,
|
||||
}}>
|
||||
<div style={{ flex: 1, minWidth: 0 }}>
|
||||
<div style={{ fontSize: 13, fontWeight: 600, color: theme.text, overflow: 'hidden', textOverflow: 'ellipsis', whiteSpace: 'nowrap' }}>
|
||||
{doc.doc_name}
|
||||
</div>
|
||||
{doc.snippet && (
|
||||
<div style={{ fontSize: 12, color: theme.text3, marginTop: 3, overflow: 'hidden', textOverflow: 'ellipsis', whiteSpace: 'nowrap' }}>
|
||||
{doc.snippet}
|
||||
</div>
|
||||
)}
|
||||
</div>
|
||||
<span className="mono" style={{ fontSize: 11, color: theme.accent, flexShrink: 0 }}>
|
||||
{Math.round(doc.score * 100)}%
|
||||
</span>
|
||||
</div>
|
||||
))}
|
||||
</div>
|
||||
</div>
|
||||
)}
|
||||
|
||||
{/* Streaming analysis output */}
|
||||
{(analysisText || analyzing) && (
|
||||
<div ref={analysisRef} style={{
|
||||
flex: 1,
|
||||
overflowY: 'auto',
|
||||
padding: '20px 24px',
|
||||
background: theme.bgCard,
|
||||
border: `1px solid ${theme.border}`,
|
||||
borderRadius: 12,
|
||||
boxShadow: !isDark ? '0 2px 8px rgba(0,0,0,0.03)' : 'none',
|
||||
}}>
|
||||
<div className="mono" style={{ fontSize: 11, color: theme.accent, letterSpacing: '1px', marginBottom: 14 }}>
|
||||
ANALYSIS {analyzing && <span style={{ animation: 'blink 1s step-end infinite' }}>▌</span>}
|
||||
</div>
|
||||
{analysisText && (
|
||||
<MarkdownText text={analysisText} textColor={theme.text2} accent={theme.accent} />
|
||||
)}
|
||||
{analyzing && !analysisText && (
|
||||
<div style={{ color: theme.text3, fontSize: 13 }}>正在分析法规影响...</div>
|
||||
)}
|
||||
</div>
|
||||
)}
|
||||
|
||||
{/* Empty analysis state */}
|
||||
{!analysisText && !analyzing && (
|
||||
<div style={{ flex: 1, display: 'flex', alignItems: 'center', justifyContent: 'center' }}>
|
||||
<div style={{ textAlign: 'center', color: theme.text3, fontSize: 13 }}>
|
||||
点击「触发智能分析」查看 AI 影响评估
|
||||
</div>
|
||||
</div>
|
||||
)}
|
||||
</div>
|
||||
);
|
||||
};
|
||||
157
frontend/src/pages/Perception/EventFeed.tsx
Normal file
157
frontend/src/pages/Perception/EventFeed.tsx
Normal file
@@ -0,0 +1,157 @@
|
||||
import React from 'react';
|
||||
import { useTheme } from '../../contexts';
|
||||
import type { RegulationEvent, ImpactLevel, EventSource } from '../../api/perception';
|
||||
|
||||
const IMPACT_CONFIG: Record<ImpactLevel, { color: string; label: string; dot: string }> = {
|
||||
high: { color: '#d64545', label: '高影响', dot: '●' },
|
||||
medium: { color: '#ff8800', label: '中影响', dot: '●' },
|
||||
low: { color: '#00d4aa', label: '低影响', dot: '●' },
|
||||
};
|
||||
|
||||
const STATUS_LABEL: Record<string, string> = {
|
||||
enacted: '已生效',
|
||||
draft: '征求意见',
|
||||
consultation: '公众咨询',
|
||||
};
|
||||
|
||||
const SOURCE_COLORS: Record<string, string> = {
|
||||
MIIT: '#e20074',
|
||||
'UN-ECE': '#4a90d9',
|
||||
ISO: '#7b68ee',
|
||||
'国标委': '#00b89c',
|
||||
'EUR-Lex': '#f5a623',
|
||||
IATF: '#9b59b6',
|
||||
};
|
||||
|
||||
interface EventFeedProps {
|
||||
events: RegulationEvent[];
|
||||
selectedId: string | null;
|
||||
onSelect: (id: string) => void;
|
||||
filterSource: string;
|
||||
filterImpact: string;
|
||||
onFilterSource: (v: string) => void;
|
||||
onFilterImpact: (v: string) => void;
|
||||
stats: { total: number; high_impact: number; medium_impact: number; low_impact: number; recent_90d: number } | null;
|
||||
loading: boolean;
|
||||
}
|
||||
|
||||
export const EventFeed: React.FC<EventFeedProps> = ({
|
||||
events, selectedId, onSelect,
|
||||
filterSource, filterImpact, onFilterSource, onFilterImpact,
|
||||
stats, loading,
|
||||
}) => {
|
||||
const { theme, isDark } = useTheme();
|
||||
|
||||
const sources: EventSource[] = ['MIIT', 'UN-ECE', 'ISO', '国标委', 'EUR-Lex', 'IATF'];
|
||||
const impacts: ImpactLevel[] = ['high', 'medium', 'low'];
|
||||
|
||||
return (
|
||||
<div style={{ display: 'flex', flexDirection: 'column', height: '100%', gap: 16 }}>
|
||||
|
||||
{/* KPI mini-cards */}
|
||||
{stats && (
|
||||
<div style={{ display: 'grid', gridTemplateColumns: 'repeat(4, 1fr)', gap: 8 }}>
|
||||
{[
|
||||
{ label: '总计', value: stats.total, color: theme.text },
|
||||
{ label: '高影响', value: stats.high_impact, color: '#d64545' },
|
||||
{ label: '中影响', value: stats.medium_impact, color: '#ff8800' },
|
||||
{ label: '近90天', value: stats.recent_90d, color: theme.accent },
|
||||
].map(({ label, value, color }) => (
|
||||
<div key={label} style={{
|
||||
padding: '10px 12px',
|
||||
background: theme.bgCard,
|
||||
border: `1px solid ${theme.border}`,
|
||||
borderRadius: 10,
|
||||
position: 'relative',
|
||||
overflow: 'hidden',
|
||||
boxShadow: 'none',
|
||||
}}>
|
||||
<div style={{ position: 'absolute', top: 0, left: 0, right: 0, height: 2, background: color }} />
|
||||
<div className="mono" style={{ fontSize: 10, color: theme.text3, letterSpacing: '0.5px' }}>{label}</div>
|
||||
<div className="mono" style={{ fontSize: 22, fontWeight: 700, color }}>{value}</div>
|
||||
</div>
|
||||
))}
|
||||
</div>
|
||||
)}
|
||||
|
||||
{/* Filter row */}
|
||||
<div style={{ display: 'flex', gap: 6, flexWrap: 'wrap' }}>
|
||||
<button
|
||||
onClick={() => onFilterSource('')}
|
||||
style={{ padding: '4px 10px', borderRadius: 20, border: `1px solid ${filterSource === '' ? theme.accent : theme.border}`, background: filterSource === '' ? theme.accent + '20' : 'transparent', color: filterSource === '' ? theme.accent : theme.text3, fontSize: 11, cursor: 'pointer' }}
|
||||
>全部来源</button>
|
||||
{sources.map(s => (
|
||||
<button key={s} onClick={() => onFilterSource(filterSource === s ? '' : s)}
|
||||
style={{ padding: '4px 10px', borderRadius: 20, border: `1px solid ${filterSource === s ? (SOURCE_COLORS[s] || theme.accent) : theme.border}`, background: filterSource === s ? (SOURCE_COLORS[s] || theme.accent) + '20' : 'transparent', color: filterSource === s ? (SOURCE_COLORS[s] || theme.accent) : theme.text3, fontSize: 11, cursor: 'pointer' }}>
|
||||
{s}
|
||||
</button>
|
||||
))}
|
||||
</div>
|
||||
<div style={{ display: 'flex', gap: 6 }}>
|
||||
<button onClick={() => onFilterImpact('')}
|
||||
style={{ padding: '4px 10px', borderRadius: 20, border: `1px solid ${filterImpact === '' ? theme.accent : theme.border}`, background: filterImpact === '' ? theme.accent + '20' : 'transparent', color: filterImpact === '' ? theme.accent : theme.text3, fontSize: 11, cursor: 'pointer' }}>
|
||||
全部等级
|
||||
</button>
|
||||
{impacts.map(lvl => (
|
||||
<button key={lvl} onClick={() => onFilterImpact(filterImpact === lvl ? '' : lvl)}
|
||||
style={{ padding: '4px 10px', borderRadius: 20, border: `1px solid ${filterImpact === lvl ? IMPACT_CONFIG[lvl].color : theme.border}`, background: filterImpact === lvl ? IMPACT_CONFIG[lvl].color + '22' : 'transparent', color: filterImpact === lvl ? IMPACT_CONFIG[lvl].color : theme.text3, fontSize: 11, cursor: 'pointer' }}>
|
||||
{IMPACT_CONFIG[lvl].dot} {IMPACT_CONFIG[lvl].label}
|
||||
</button>
|
||||
))}
|
||||
</div>
|
||||
|
||||
{/* Event list */}
|
||||
<div style={{ flex: 1, overflowY: 'auto', display: 'flex', flexDirection: 'column', gap: 8 }}>
|
||||
{loading && (
|
||||
<div className="mono" style={{ fontSize: 12, color: theme.text3, padding: '16px 0' }}>加载中...</div>
|
||||
)}
|
||||
{!loading && events.length === 0 && (
|
||||
<div style={{ fontSize: 13, color: theme.text3, padding: '32px 0', textAlign: 'center' }}>暂无法规动态</div>
|
||||
)}
|
||||
{events.map(evt => {
|
||||
const cfg = IMPACT_CONFIG[evt.impact_level];
|
||||
const isSelected = evt.id === selectedId;
|
||||
const srcColor = SOURCE_COLORS[evt.source] || theme.accent;
|
||||
return (
|
||||
<div
|
||||
key={evt.id}
|
||||
onClick={() => onSelect(evt.id)}
|
||||
style={{
|
||||
padding: '14px 16px',
|
||||
background: isSelected ? theme.bgHover : theme.bgCard,
|
||||
borderRadius: 10,
|
||||
border: `1px solid ${isSelected ? theme.accent : theme.border}`,
|
||||
borderLeft: `4px solid ${cfg.color}`,
|
||||
cursor: 'pointer',
|
||||
transition: 'all 0.15s ease',
|
||||
boxShadow: isSelected ? `0 0 0 1px ${theme.accent}40` : 'none',
|
||||
}}
|
||||
>
|
||||
{/* Source + Status row */}
|
||||
<div style={{ display: 'flex', alignItems: 'center', gap: 6, marginBottom: 6 }}>
|
||||
<span style={{ fontSize: 10, fontWeight: 700, color: srcColor, background: srcColor + '18', borderRadius: 4, padding: '2px 7px' }}>{evt.source}</span>
|
||||
<span className="mono" style={{ fontSize: 10, color: theme.text3 }}>{evt.standard_code}</span>
|
||||
<span style={{ marginLeft: 'auto', fontSize: 10, color: evt.status === 'enacted' ? theme.green : '#ff8800', background: evt.status === 'enacted' ? theme.green + '18' : '#ff880018', borderRadius: 4, padding: '2px 6px', fontWeight: 600 }}>
|
||||
{STATUS_LABEL[evt.status] ?? evt.status}
|
||||
</span>
|
||||
</div>
|
||||
|
||||
{/* Title */}
|
||||
<div style={{ fontSize: 13, fontWeight: 600, color: theme.text, lineHeight: 1.4, marginBottom: 6, display: '-webkit-box', WebkitLineClamp: 2, WebkitBoxOrient: 'vertical', overflow: 'hidden' }}>
|
||||
{evt.title}
|
||||
</div>
|
||||
|
||||
{/* Date + impact */}
|
||||
<div style={{ display: 'flex', alignItems: 'center', justifyContent: 'space-between' }}>
|
||||
<span className="mono" style={{ fontSize: 11, color: theme.text3 }}>
|
||||
{evt.published_at}{evt.effective_at ? ` → ${evt.effective_at}` : ''}
|
||||
</span>
|
||||
<span style={{ fontSize: 10, color: cfg.color, fontWeight: 700 }}>{cfg.dot} {cfg.label}</span>
|
||||
</div>
|
||||
</div>
|
||||
);
|
||||
})}
|
||||
</div>
|
||||
</div>
|
||||
);
|
||||
};
|
||||
146
frontend/src/pages/Perception/PerceptionPage.tsx
Normal file
146
frontend/src/pages/Perception/PerceptionPage.tsx
Normal file
@@ -0,0 +1,146 @@
|
||||
import React, { useCallback, useEffect, useRef, useState } from 'react';
|
||||
import { useTheme } from '../../contexts';
|
||||
import { TPattern } from '../../components/common/TPattern';
|
||||
import {
|
||||
listEvents,
|
||||
getPerceptionStats,
|
||||
analyzeEvent,
|
||||
type RegulationEvent,
|
||||
type PerceptionStats,
|
||||
type AffectedDoc,
|
||||
} from '../../api/perception';
|
||||
import { EventFeed } from './EventFeed';
|
||||
import { AnalysisPanel } from './AnalysisPanel';
|
||||
|
||||
export const PerceptionPage: React.FC = () => {
|
||||
const { theme } = useTheme();
|
||||
|
||||
// Feed state
|
||||
const [events, setEvents] = useState<RegulationEvent[]>([]);
|
||||
const [stats, setStats] = useState<PerceptionStats | null>(null);
|
||||
const [feedLoading, setFeedLoading] = useState(true);
|
||||
const [filterSource, setFilterSource] = useState('');
|
||||
const [filterImpact, setFilterImpact] = useState('');
|
||||
|
||||
// Selected event
|
||||
const [selectedId, setSelectedId] = useState<string | null>(null);
|
||||
const selectedEvent = events.find(e => e.id === selectedId) ?? null;
|
||||
|
||||
// Analysis state
|
||||
const [analyzing, setAnalyzing] = useState(false);
|
||||
const [analysisText, setAnalysisText] = useState('');
|
||||
const [affectedDocs, setAffectedDocs] = useState<AffectedDoc[]>([]);
|
||||
const abortRef = useRef<AbortController | null>(null);
|
||||
|
||||
// Load events + stats
|
||||
const loadFeed = useCallback(async () => {
|
||||
setFeedLoading(true);
|
||||
try {
|
||||
const [evtRes, statsRes] = await Promise.all([
|
||||
listEvents({
|
||||
source: filterSource || undefined,
|
||||
impact_level: filterImpact || undefined,
|
||||
}),
|
||||
getPerceptionStats(),
|
||||
]);
|
||||
setEvents(evtRes.events);
|
||||
setStats(statsRes);
|
||||
} catch {
|
||||
// silent
|
||||
} finally {
|
||||
setFeedLoading(false);
|
||||
}
|
||||
}, [filterSource, filterImpact]);
|
||||
|
||||
useEffect(() => {
|
||||
const timerId = window.setTimeout(() => { void loadFeed(); }, 0);
|
||||
return () => window.clearTimeout(timerId);
|
||||
}, [loadFeed]);
|
||||
|
||||
// When selecting a new event, clear previous analysis
|
||||
const handleSelectEvent = (id: string) => {
|
||||
if (id === selectedId) return;
|
||||
abortRef.current?.abort();
|
||||
setSelectedId(id);
|
||||
setAnalysisText('');
|
||||
setAffectedDocs([]);
|
||||
setAnalyzing(false);
|
||||
};
|
||||
|
||||
const handleAnalyze = useCallback(() => {
|
||||
if (!selectedId || analyzing) return;
|
||||
abortRef.current?.abort();
|
||||
const ctrl = new AbortController();
|
||||
abortRef.current = ctrl;
|
||||
setAnalysisText('');
|
||||
setAffectedDocs([]);
|
||||
setAnalyzing(true);
|
||||
|
||||
void analyzeEvent(
|
||||
selectedId,
|
||||
(msg) => {
|
||||
if (msg.type === 'sources' && msg.docs) {
|
||||
setAffectedDocs(msg.docs);
|
||||
} else if (msg.type === 'content' && msg.text) {
|
||||
setAnalysisText(prev => prev + msg.text);
|
||||
} else if (msg.type === 'error') {
|
||||
setAnalysisText(prev => prev + `\n\n⚠ 分析出错:${msg.text ?? '未知错误'}`);
|
||||
}
|
||||
},
|
||||
() => setAnalyzing(false),
|
||||
ctrl.signal,
|
||||
);
|
||||
}, [selectedId, analyzing]);
|
||||
|
||||
const handleAbort = () => {
|
||||
abortRef.current?.abort();
|
||||
setAnalyzing(false);
|
||||
};
|
||||
|
||||
return (
|
||||
<div className="relative flex min-h-0 flex-1 flex-col">
|
||||
<style>{`
|
||||
@keyframes blink { 0%,100%{opacity:1} 50%{opacity:0} }
|
||||
`}</style>
|
||||
<TPattern />
|
||||
|
||||
{/* Page header */}
|
||||
<div style={{ display: 'flex', alignItems: 'baseline', gap: 16, marginBottom: 24 }}>
|
||||
<h1 style={{ fontSize: 20, fontWeight: 700, color: theme.text, margin: 0 }}>智能感知</h1>
|
||||
<span style={{ fontSize: 13, color: theme.text3 }}>法规动态实时追踪 · 知识库影响分析</span>
|
||||
</div>
|
||||
|
||||
{/* Split layout */}
|
||||
<div style={{
|
||||
display: 'grid',
|
||||
gridTemplateColumns: '400px 1fr',
|
||||
gap: 24,
|
||||
flex: 1,
|
||||
minHeight: 560,
|
||||
}}>
|
||||
{/* Left: Event feed */}
|
||||
<EventFeed
|
||||
events={events}
|
||||
selectedId={selectedId}
|
||||
onSelect={handleSelectEvent}
|
||||
filterSource={filterSource}
|
||||
filterImpact={filterImpact}
|
||||
onFilterSource={setFilterSource}
|
||||
onFilterImpact={setFilterImpact}
|
||||
stats={stats}
|
||||
loading={feedLoading}
|
||||
/>
|
||||
|
||||
{/* Right: Analysis panel */}
|
||||
<AnalysisPanel
|
||||
event={selectedEvent}
|
||||
analyzing={analyzing}
|
||||
analysisText={analysisText}
|
||||
affectedDocs={affectedDocs}
|
||||
onAnalyze={handleAnalyze}
|
||||
onAbort={handleAbort}
|
||||
/>
|
||||
</div>
|
||||
</div>
|
||||
);
|
||||
};
|
||||
1
frontend/src/pages/Perception/index.ts
Normal file
1
frontend/src/pages/Perception/index.ts
Normal file
@@ -0,0 +1 @@
|
||||
export { PerceptionPage } from './PerceptionPage';
|
||||
@@ -1,4 +1,4 @@
|
||||
import React, { useRef } from 'react';
|
||||
import React from 'react';
|
||||
import { useTheme } from '../../contexts';
|
||||
import type { RetrievalData } from '../../types';
|
||||
|
||||
|
||||
@@ -133,7 +133,6 @@ export const RagChatPage: React.FC = () => {
|
||||
sessionId,
|
||||
abortRef.current.signal,
|
||||
);
|
||||
// eslint-disable-next-line react-hooks/exhaustive-deps
|
||||
}, [filterRegulationType, sessionId]);
|
||||
|
||||
const sendMessage = (text: string) => {
|
||||
@@ -173,7 +172,7 @@ export const RagChatPage: React.FC = () => {
|
||||
};
|
||||
|
||||
return (
|
||||
<div style={{ flex: 1, display: 'flex', height: 'calc(100vh - 128px)' }}>
|
||||
<div style={{ flex: 1, display: 'flex', minHeight: 0, height: '100%' }}>
|
||||
{/* ── Left: chat panel ─────────────────────────────────── */}
|
||||
<div style={{
|
||||
flex: '0 0 60%',
|
||||
|
||||
@@ -1,6 +1,5 @@
|
||||
import React, { useCallback, useEffect, useState } from 'react';
|
||||
import { useTheme } from '../../contexts';
|
||||
import { Content } from '../../components/layout/Content';
|
||||
import { TPattern } from '../../components/common/TPattern';
|
||||
import { getSystemStats, getSystemConfig, getSystemHealth, type SystemStats, type SystemConfig, type SystemHealth } from '../../api/status';
|
||||
import { getDocumentList, type DocInfo } from '../../api/docs';
|
||||
@@ -107,7 +106,8 @@ export const StatusPage: React.FC = () => {
|
||||
|
||||
// Initial load
|
||||
useEffect(() => {
|
||||
void loadData();
|
||||
const timerId = window.setTimeout(() => { void loadData(); }, 0);
|
||||
return () => window.clearTimeout(timerId);
|
||||
}, [loadData]);
|
||||
|
||||
// Auto-poll every 5 s while any document is still processing
|
||||
@@ -119,7 +119,7 @@ export const StatusPage: React.FC = () => {
|
||||
}, [docs, loadData]);
|
||||
|
||||
return (
|
||||
<Content>
|
||||
<div className="relative w-full">
|
||||
<style>{`@keyframes spin { to { transform: rotate(360deg); } }`}</style>
|
||||
<TPattern />
|
||||
|
||||
@@ -386,6 +386,6 @@ export const StatusPage: React.FC = () => {
|
||||
</div>
|
||||
))}
|
||||
</section>
|
||||
</Content>
|
||||
</div>
|
||||
);
|
||||
};
|
||||
|
||||
20
frontend/src/router/AppRouter.tsx
Normal file
20
frontend/src/router/AppRouter.tsx
Normal file
@@ -0,0 +1,20 @@
|
||||
import { BrowserRouter, Navigate, Route, Routes } from 'react-router-dom';
|
||||
|
||||
import { AppShell } from '../components/layout/AppShell';
|
||||
import { appTabs, defaultTab } from './tabs';
|
||||
|
||||
export function AppRouter() {
|
||||
return (
|
||||
<BrowserRouter>
|
||||
<Routes>
|
||||
<Route element={<AppShell />}>
|
||||
<Route index element={<Navigate to={defaultTab.path} replace />} />
|
||||
{appTabs.map((tab) => (
|
||||
<Route key={tab.id} path={tab.path.slice(1)} element={null} />
|
||||
))}
|
||||
<Route path="*" element={<Navigate to={defaultTab.path} replace />} />
|
||||
</Route>
|
||||
</Routes>
|
||||
</BrowserRouter>
|
||||
);
|
||||
}
|
||||
73
frontend/src/router/tabs.tsx
Normal file
73
frontend/src/router/tabs.tsx
Normal file
@@ -0,0 +1,73 @@
|
||||
import type { ComponentType } from 'react';
|
||||
|
||||
import { CompliancePage } from '../pages/Compliance';
|
||||
import { DocsPage } from '../pages/Docs';
|
||||
import { PerceptionPage } from '../pages/Perception';
|
||||
import { RagChatPage } from '../pages/RagChat';
|
||||
import { StatusPage } from '../pages/Status';
|
||||
|
||||
export type TabId = 'perception' | 'docs' | 'compliance' | 'status' | 'rag';
|
||||
|
||||
export type ContentWidth = 'default' | 'wide' | 'full';
|
||||
|
||||
export interface AppTabConfig {
|
||||
id: TabId;
|
||||
path: string;
|
||||
label: string;
|
||||
component: ComponentType;
|
||||
keepAlive: boolean;
|
||||
contentWidth: ContentWidth;
|
||||
fillHeight?: boolean;
|
||||
}
|
||||
|
||||
export const appTabs: AppTabConfig[] = [
|
||||
{
|
||||
id: 'perception',
|
||||
path: '/perception',
|
||||
label: '智能感知',
|
||||
component: PerceptionPage,
|
||||
keepAlive: true,
|
||||
contentWidth: 'wide',
|
||||
fillHeight: true,
|
||||
},
|
||||
{
|
||||
id: 'docs',
|
||||
path: '/docs',
|
||||
label: '文档管理',
|
||||
component: DocsPage,
|
||||
keepAlive: true,
|
||||
contentWidth: 'default',
|
||||
},
|
||||
{
|
||||
id: 'compliance',
|
||||
path: '/compliance',
|
||||
label: '合规分析',
|
||||
component: CompliancePage,
|
||||
keepAlive: true,
|
||||
contentWidth: 'wide',
|
||||
fillHeight: true,
|
||||
},
|
||||
{
|
||||
id: 'status',
|
||||
path: '/status',
|
||||
label: '系统状态',
|
||||
component: StatusPage,
|
||||
keepAlive: true,
|
||||
contentWidth: 'default',
|
||||
},
|
||||
{
|
||||
id: 'rag',
|
||||
path: '/rag',
|
||||
label: '法规对话',
|
||||
component: RagChatPage,
|
||||
keepAlive: true,
|
||||
contentWidth: 'wide',
|
||||
fillHeight: true,
|
||||
},
|
||||
];
|
||||
|
||||
export const defaultTab = appTabs.find((tab) => tab.id === 'compliance') ?? appTabs[0];
|
||||
|
||||
export function getTabByPath(pathname: string): AppTabConfig {
|
||||
return appTabs.find((tab) => tab.path === pathname) ?? defaultTab;
|
||||
}
|
||||
@@ -1,8 +1,9 @@
|
||||
@import url('https://fonts.googleapis.com/css2?family=TeleNeo:wght@400;500;600;700&family=JetBrains+Mono:wght@400;500;600&display=swap');
|
||||
@import "tw-animate-css";
|
||||
@import "tailwindcss";
|
||||
|
||||
@custom-variant dark (&:is(.dark *));
|
||||
|
||||
@tailwind base;
|
||||
@tailwind components;
|
||||
@tailwind utilities;
|
||||
|
||||
/* Light mode (default) */
|
||||
:root {
|
||||
@@ -19,6 +20,38 @@
|
||||
--t-orange: #ff7700;
|
||||
--t-accent-glow: rgba(226,0,116,0.08);
|
||||
--t-pattern-opacity: 0.04;
|
||||
--background: var(--t-bg);
|
||||
--foreground: var(--t-text);
|
||||
--card: var(--t-bg-card);
|
||||
--card-foreground: var(--t-text);
|
||||
--popover: var(--t-bg-card);
|
||||
--popover-foreground: var(--t-text);
|
||||
--primary: #e20074;
|
||||
--primary-foreground: #ffffff;
|
||||
--secondary: var(--t-bg-hover);
|
||||
--secondary-foreground: var(--t-text);
|
||||
--muted: var(--t-bg-hover);
|
||||
--muted-foreground: var(--t-text3);
|
||||
--accent: rgba(226, 0, 116, 0.08);
|
||||
--accent-foreground: #e20074;
|
||||
--destructive: #ff4444;
|
||||
--border: var(--t-border);
|
||||
--input: var(--t-border);
|
||||
--ring: rgba(226, 0, 116, 0.35);
|
||||
--chart-1: #e20074;
|
||||
--chart-2: #be0060;
|
||||
--chart-3: #00b89c;
|
||||
--chart-4: #ff7700;
|
||||
--chart-5: #4a4a5a;
|
||||
--radius: 0.625rem;
|
||||
--sidebar: var(--t-bg-card);
|
||||
--sidebar-foreground: var(--t-text);
|
||||
--sidebar-primary: #e20074;
|
||||
--sidebar-primary-foreground: #ffffff;
|
||||
--sidebar-accent: var(--t-bg-hover);
|
||||
--sidebar-accent-foreground: var(--t-text);
|
||||
--sidebar-border: var(--t-border);
|
||||
--sidebar-ring: rgba(226, 0, 116, 0.35);
|
||||
}
|
||||
|
||||
/* Dark mode */
|
||||
@@ -36,6 +69,80 @@
|
||||
--t-orange: #ff8800;
|
||||
--t-accent-glow: rgba(226,0,116,0.12);
|
||||
--t-pattern-opacity: 0.03;
|
||||
--background: var(--t-bg);
|
||||
--foreground: var(--t-text);
|
||||
--card: var(--t-bg-card);
|
||||
--card-foreground: var(--t-text);
|
||||
--popover: var(--t-bg-card);
|
||||
--popover-foreground: var(--t-text);
|
||||
--primary: #e20074;
|
||||
--primary-foreground: #ffffff;
|
||||
--secondary: var(--t-bg-hover);
|
||||
--secondary-foreground: var(--t-text);
|
||||
--muted: var(--t-bg-hover);
|
||||
--muted-foreground: var(--t-text3);
|
||||
--accent: rgba(226, 0, 116, 0.14);
|
||||
--accent-foreground: #ff7abf;
|
||||
--destructive: #ff4444;
|
||||
--border: var(--t-border);
|
||||
--input: var(--t-border-light);
|
||||
--ring: rgba(226, 0, 116, 0.45);
|
||||
--chart-1: #e20074;
|
||||
--chart-2: #f04090;
|
||||
--chart-3: #00d4aa;
|
||||
--chart-4: #ff8800;
|
||||
--chart-5: #c0c0d0;
|
||||
--sidebar: var(--t-bg-card);
|
||||
--sidebar-foreground: var(--t-text);
|
||||
--sidebar-primary: #e20074;
|
||||
--sidebar-primary-foreground: #ffffff;
|
||||
--sidebar-accent: var(--t-bg-hover);
|
||||
--sidebar-accent-foreground: var(--t-text);
|
||||
--sidebar-border: var(--t-border);
|
||||
--sidebar-ring: rgba(226, 0, 116, 0.45);
|
||||
}
|
||||
|
||||
/* Dim mode — Indigo Dusk: deep navy-purple mid-tone between dark and light */
|
||||
.dim {
|
||||
--t-bg: #1e1b2e;
|
||||
--t-bg-card: #252237;
|
||||
--t-bg-hover: #2d2945;
|
||||
--t-bg-elevated: #292541;
|
||||
--t-border: #3a3650;
|
||||
--t-border-light: #504c6e;
|
||||
--t-text: #f0eeff;
|
||||
--t-text2: #b8b4d8;
|
||||
--t-text3: #7a7698;
|
||||
--t-green: #00c4a0;
|
||||
--t-orange: #ff8820;
|
||||
--t-accent-glow: rgba(226,0,116,0.14);
|
||||
--t-pattern-opacity: 0.04;
|
||||
--background: var(--t-bg);
|
||||
--foreground: var(--t-text);
|
||||
--card: var(--t-bg-card);
|
||||
--card-foreground: var(--t-text);
|
||||
--popover: var(--t-bg-card);
|
||||
--popover-foreground: var(--t-text);
|
||||
--primary: #e20074;
|
||||
--primary-foreground: #ffffff;
|
||||
--secondary: var(--t-bg-hover);
|
||||
--secondary-foreground: var(--t-text);
|
||||
--muted: var(--t-bg-hover);
|
||||
--muted-foreground: var(--t-text3);
|
||||
--accent: rgba(226, 0, 116, 0.12);
|
||||
--accent-foreground: #f04090;
|
||||
--destructive: #ff4444;
|
||||
--border: var(--t-border);
|
||||
--input: var(--t-border-light);
|
||||
--ring: rgba(226, 0, 116, 0.40);
|
||||
--sidebar: var(--t-bg-card);
|
||||
--sidebar-foreground: var(--t-text);
|
||||
--sidebar-primary: #e20074;
|
||||
--sidebar-primary-foreground: #ffffff;
|
||||
--sidebar-accent: var(--t-bg-hover);
|
||||
--sidebar-accent-foreground: var(--t-text);
|
||||
--sidebar-border: var(--t-border);
|
||||
--sidebar-ring: rgba(226, 0, 116, 0.40);
|
||||
}
|
||||
|
||||
/* Base styles */
|
||||
@@ -66,6 +173,13 @@ body.dark-mode {
|
||||
background: #0a0a12;
|
||||
}
|
||||
|
||||
/* Dim mode body */
|
||||
.dim body,
|
||||
body.dim-mode {
|
||||
color: #f0eeff;
|
||||
background: #1e1b2e;
|
||||
}
|
||||
|
||||
/* Selection */
|
||||
::selection {
|
||||
background: rgba(226, 0, 116, 0.3);
|
||||
@@ -100,6 +214,11 @@ button, input {
|
||||
transition: none;
|
||||
}
|
||||
|
||||
/* Shell navigation manages its own transition timing. */
|
||||
[data-shell-tab='true'] {
|
||||
transition: color 0.2s ease-out;
|
||||
}
|
||||
|
||||
/* T-Systems Button Style */
|
||||
.t-btn,
|
||||
.t-btn:hover {
|
||||
@@ -131,17 +250,22 @@ button, input {
|
||||
background: linear-gradient(180deg, #12121f, #0a0a12);
|
||||
}
|
||||
|
||||
/* Card gradient for dim mode */
|
||||
.dim .t-card-gradient {
|
||||
background: linear-gradient(180deg, #e8e5f4, #f0eef8);
|
||||
}
|
||||
|
||||
/* Card gradient for light mode */
|
||||
:not(.dark) .t-card-gradient {
|
||||
:not(.dark):not(.dim) .t-card-gradient {
|
||||
background: linear-gradient(180deg, #ffffff, #fafafa);
|
||||
}
|
||||
|
||||
/* Light mode shadow for cards */
|
||||
:not(.dark) .t-card-shadow {
|
||||
:not(.dark):not(.dim) .t-card-shadow {
|
||||
box-shadow: 0 2px 8px rgba(226,0,116,0.04);
|
||||
}
|
||||
|
||||
:not(.dark) .t-card-shadow-lg {
|
||||
:not(.dark):not(.dim) .t-card-shadow-lg {
|
||||
box-shadow: 0 4px 16px rgba(226,0,116,0.08);
|
||||
}
|
||||
|
||||
@@ -272,3 +396,59 @@ button, input {
|
||||
background: linear-gradient(135deg, #f0208a 0%, #d01070 100%);
|
||||
}
|
||||
}
|
||||
|
||||
@theme inline {
|
||||
--font-heading: 'TeleNeo', 'Segoe UI', system-ui, sans-serif;
|
||||
--font-sans: 'TeleNeo', 'Segoe UI', system-ui, sans-serif;
|
||||
--font-mono: 'JetBrains Mono', monospace;
|
||||
--color-sidebar-ring: var(--sidebar-ring);
|
||||
--color-sidebar-border: var(--sidebar-border);
|
||||
--color-sidebar-accent-foreground: var(--sidebar-accent-foreground);
|
||||
--color-sidebar-accent: var(--sidebar-accent);
|
||||
--color-sidebar-primary-foreground: var(--sidebar-primary-foreground);
|
||||
--color-sidebar-primary: var(--sidebar-primary);
|
||||
--color-sidebar-foreground: var(--sidebar-foreground);
|
||||
--color-sidebar: var(--sidebar);
|
||||
--color-chart-5: var(--chart-5);
|
||||
--color-chart-4: var(--chart-4);
|
||||
--color-chart-3: var(--chart-3);
|
||||
--color-chart-2: var(--chart-2);
|
||||
--color-chart-1: var(--chart-1);
|
||||
--color-ring: var(--ring);
|
||||
--color-input: var(--input);
|
||||
--color-border: var(--border);
|
||||
--color-destructive: var(--destructive);
|
||||
--color-accent-foreground: var(--accent-foreground);
|
||||
--color-accent: var(--accent);
|
||||
--color-muted-foreground: var(--muted-foreground);
|
||||
--color-muted: var(--muted);
|
||||
--color-secondary-foreground: var(--secondary-foreground);
|
||||
--color-secondary: var(--secondary);
|
||||
--color-primary-foreground: var(--primary-foreground);
|
||||
--color-primary: var(--primary);
|
||||
--color-popover-foreground: var(--popover-foreground);
|
||||
--color-popover: var(--popover);
|
||||
--color-card-foreground: var(--card-foreground);
|
||||
--color-card: var(--card);
|
||||
--color-foreground: var(--foreground);
|
||||
--color-background: var(--background);
|
||||
--radius-sm: calc(var(--radius) * 0.6);
|
||||
--radius-md: calc(var(--radius) * 0.8);
|
||||
--radius-lg: var(--radius);
|
||||
--radius-xl: calc(var(--radius) * 1.4);
|
||||
--radius-2xl: calc(var(--radius) * 1.8);
|
||||
--radius-3xl: calc(var(--radius) * 2.2);
|
||||
--radius-4xl: calc(var(--radius) * 2.6);
|
||||
}
|
||||
|
||||
@layer base {
|
||||
* {
|
||||
@apply border-border outline-ring/50;
|
||||
}
|
||||
body {
|
||||
@apply bg-background text-foreground;
|
||||
}
|
||||
html {
|
||||
@apply font-sans;
|
||||
}
|
||||
}
|
||||
|
||||
Some files were not shown because too many files have changed in this diff Show More
Reference in New Issue
Block a user