2026-04-28 11:29:33 +08:00
|
|
|
|
# AI+合规智能中枢 - 法律法规文档解析入库
|
|
|
|
|
|
|
|
|
|
|
|
面向车企与工厂的合规智能平台,实现法规文档的解析、分块、嵌入和向量存储。
|
|
|
|
|
|
|
|
|
|
|
|
## MVP功能
|
|
|
|
|
|
|
|
|
|
|
|
本次实现的核心功能(最小可用版本):
|
|
|
|
|
|
|
2026-05-18 16:32:42 +08:00
|
|
|
|
- ✅ PDF/DOC/DOCX 文档解析(阿里云文档智能)
|
|
|
|
|
|
- ✅ 基于阿里云 `vector_chunks` 的统一切片
|
|
|
|
|
|
- ✅ OpenAI 兼容 embedding(`text-embedding-v3`,1536维)
|
|
|
|
|
|
- ✅ Milvus 向量数据库存储与 dense-only 检索
|
2026-04-28 11:29:33 +08:00
|
|
|
|
- ✅ FastAPI接口封装
|
|
|
|
|
|
|
|
|
|
|
|
## 项目结构
|
|
|
|
|
|
|
2026-05-14 18:09:15 +08:00
|
|
|
|
```text
|
|
|
|
|
|
AIRegulation-DocAnalysis-Demo/
|
|
|
|
|
|
├── backend/
|
|
|
|
|
|
│ ├── app/
|
|
|
|
|
|
│ │ ├── api/ # FastAPI 接口层
|
2026-05-18 16:32:42 +08:00
|
|
|
|
│ │ ├── application/ # 用例编排层
|
|
|
|
|
|
│ │ ├── domain/ # 领域模型与稳定端口
|
|
|
|
|
|
│ │ ├── infrastructure/ # MinIO / Milvus / 阿里云 / embedding / session 适配
|
2026-05-22 09:50:30 +08:00
|
|
|
|
│ │ ├── shared/ # 组合根、配置无关 wiring 与横切支撑
|
2026-05-14 18:09:15 +08:00
|
|
|
|
│ │ ├── config/ # 配置与日志
|
2026-05-22 09:50:30 +08:00
|
|
|
|
│ │ ├── services/ # 迁移期 legacy façade,不是新增业务逻辑默认落点
|
|
|
|
|
|
│ │ ├── workflows/ # 迁移期 legacy workflow,不是新增业务逻辑默认落点
|
2026-05-14 18:09:15 +08:00
|
|
|
|
│ │ └── workers/
|
|
|
|
|
|
│ ├── requirements.txt
|
|
|
|
|
|
│ └── main.py
|
|
|
|
|
|
├── frontend/ # Vite React 前端
|
|
|
|
|
|
├── tests/ # 根级测试,导入 backend/app
|
2026-04-28 11:29:33 +08:00
|
|
|
|
├── docker/
|
2026-05-14 18:09:15 +08:00
|
|
|
|
│ └── docker-compose.yml
|
2026-04-28 11:29:33 +08:00
|
|
|
|
├── pyproject.toml
|
|
|
|
|
|
└── .env.example
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
## 快速开始
|
|
|
|
|
|
|
|
|
|
|
|
### 1. 安装依赖
|
|
|
|
|
|
|
|
|
|
|
|
```bash
|
2026-05-18 22:30:28 +08:00
|
|
|
|
./dev.sh setup
|
2026-04-28 11:29:33 +08:00
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
### 2. 启动Milvus向量数据库
|
|
|
|
|
|
|
|
|
|
|
|
```bash
|
|
|
|
|
|
cd docker
|
|
|
|
|
|
docker-compose up -d
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
等待Milvus启动完成(约30秒):
|
|
|
|
|
|
```bash
|
|
|
|
|
|
docker-compose logs -f milvus
|
|
|
|
|
|
```
|
|
|
|
|
|
|
2026-05-18 16:32:42 +08:00
|
|
|
|
### 3. 启动API服务
|
2026-04-28 11:29:33 +08:00
|
|
|
|
|
|
|
|
|
|
```bash
|
2026-05-18 22:30:28 +08:00
|
|
|
|
./dev.sh start api --foreground
|
2026-04-28 11:29:33 +08:00
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
访问API文档:http://localhost:8000/docs
|
|
|
|
|
|
|
|
|
|
|
|
## API接口
|
|
|
|
|
|
|
2026-05-22 09:50:30 +08:00
|
|
|
|
## Backend Architecture
|
|
|
|
|
|
|
|
|
|
|
|
- Backend 架构规范文档:`docs/architecture/backend-project-architecture.md`
|
|
|
|
|
|
- Backend 迁移 RFC:`docs/rfc/backend-api-parsing-embedding-migration-requirements.md`
|
|
|
|
|
|
- 后续 backend 新增功能、重构和技术替换必须同时满足 RFC 与架构文档。
|
|
|
|
|
|
- `backend/app/services/*` 与 `backend/app/workflows/*` 当前属于迁移期遗留目录,除迁移或兼容修复外,不应继续承载新的业务编排。
|
|
|
|
|
|
|
2026-04-28 11:29:33 +08:00
|
|
|
|
### 上传文档
|
|
|
|
|
|
|
|
|
|
|
|
```bash
|
|
|
|
|
|
curl -X POST http://localhost:8000/api/v1/documents/upload \
|
|
|
|
|
|
-F "file=@your_regulation.pdf" \
|
|
|
|
|
|
-F "doc_name=GB 7258-2017" \
|
|
|
|
|
|
-F "regulation_type=车辆安全"
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
### 检索法规
|
|
|
|
|
|
|
|
|
|
|
|
```bash
|
|
|
|
|
|
curl -X POST http://localhost:8000/api/v1/knowledge/search \
|
|
|
|
|
|
-H "Content-Type: application/json" \
|
|
|
|
|
|
-d '{"query": "机动车安全技术要求", "top_k": 10}'
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
## 技术栈
|
|
|
|
|
|
|
|
|
|
|
|
| 类别 | 技术 |
|
|
|
|
|
|
|------|------|
|
2026-05-18 16:32:42 +08:00
|
|
|
|
| 文档解析 | 阿里云文档智能 + python-docx |
|
|
|
|
|
|
| 分块策略 | 阿里云 `vector_chunks` |
|
|
|
|
|
|
| 嵌入模型 | `text-embedding-v3`(1536维 Dense) |
|
2026-04-28 11:29:33 +08:00
|
|
|
|
| 向量数据库 | Milvus 2.4(本地Docker部署) |
|
2026-05-18 16:32:42 +08:00
|
|
|
|
| 检索方式 | Dense-only 检索 |
|
2026-04-28 11:29:33 +08:00
|
|
|
|
| API框架 | FastAPI |
|
|
|
|
|
|
|
|
|
|
|
|
## 配置
|
|
|
|
|
|
|
|
|
|
|
|
创建 `.env` 文件(参考 `.env.example`):
|
|
|
|
|
|
|
|
|
|
|
|
```env
|
|
|
|
|
|
# Milvus配置
|
|
|
|
|
|
MILVUS_HOST=localhost
|
|
|
|
|
|
MILVUS_PORT=19530
|
|
|
|
|
|
|
2026-05-18 16:32:42 +08:00
|
|
|
|
# 阿里云文档解析
|
|
|
|
|
|
ALIBABA_ACCESS_KEY_ID=your_aliyun_access_key_id
|
|
|
|
|
|
ALIBABA_ACCESS_KEY_SECRET=your_aliyun_access_key_secret
|
2026-05-18 22:30:28 +08:00
|
|
|
|
PARSER_BACKEND=aliyun
|
|
|
|
|
|
CHUNK_BACKEND=aliyun
|
2026-05-18 16:32:42 +08:00
|
|
|
|
|
|
|
|
|
|
# embedding 配置
|
|
|
|
|
|
EMBEDDING_MODEL=text-embedding-v3
|
|
|
|
|
|
EMBEDDING_DIM=1536
|
|
|
|
|
|
EMBEDDING_API_KEY=your_embedding_api_key_here
|
2026-04-28 11:29:33 +08:00
|
|
|
|
|
|
|
|
|
|
# 分块配置
|
|
|
|
|
|
CHUNK_SIZE=512
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
## 后续迭代(不在本次MVP范围)
|
|
|
|
|
|
|
2026-05-18 16:32:42 +08:00
|
|
|
|
- LLM摘要生成(当前上传主链路默认不生成)
|
2026-04-28 11:29:33 +08:00
|
|
|
|
- 文档上传UI界面
|
|
|
|
|
|
- 混合检索问答功能
|
|
|
|
|
|
- 法规变更监控与自动更新
|
|
|
|
|
|
|
2026-05-18 22:30:28 +08:00
|
|
|
|
## 解析产物
|
|
|
|
|
|
|
|
|
|
|
|
上传成功后,系统会把阿里云解析的中间结果持久化到 MinIO:
|
|
|
|
|
|
|
|
|
|
|
|
- `artifacts/{doc_id}/layouts.json`
|
|
|
|
|
|
- `artifacts/{doc_id}/structure_nodes.json`
|
|
|
|
|
|
- `artifacts/{doc_id}/semantic_blocks.json`
|
|
|
|
|
|
- `artifacts/{doc_id}/vector_chunks.json`
|
|
|
|
|
|
|
|
|
|
|
|
当前默认 Milvus collection 为 `regulations_dense_1536_v2`。
|
|
|
|
|
|
|
2026-04-28 11:29:33 +08:00
|
|
|
|
## 许可证
|
|
|
|
|
|
|
2026-05-14 18:09:15 +08:00
|
|
|
|
MIT License
|