Files
AIRegulation-DocAnalysis/README.md
ash66 c22b03dc07 feat: Migrate document parsing to Aliyun and update embedding configurations
- Updated LocalDocumentParser to include raw_layouts and artifact_prefix from settings.
- Added new documents with failure reasons and metadata to documents.json for better error tracking.
- Created a new documentation file detailing the Aliyun ingest implementation process.
- Updated RFC to reflect changes in the parsing backend and embedding dimensions.
- Modified tests to accommodate the new embedding dimension of 1024 and updated parser and chunk builder assertions.
- Verified migration configurations to ensure correct settings for embedding model and backend.
2026-05-18 22:30:28 +08:00

140 lines
3.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# AI+合规智能中枢 - 法律法规文档解析入库
面向车企与工厂的合规智能平台,实现法规文档的解析、分块、嵌入和向量存储。
## MVP功能
本次实现的核心功能(最小可用版本):
- ✅ PDF/DOC/DOCX 文档解析(阿里云文档智能)
- ✅ 基于阿里云 `vector_chunks` 的统一切片
- ✅ OpenAI 兼容 embedding`text-embedding-v3`1536维
- ✅ Milvus 向量数据库存储与 dense-only 检索
- ✅ FastAPI接口封装
## 项目结构
```text
AIRegulation-DocAnalysis-Demo/
├── backend/
│ ├── app/
│ │ ├── api/ # FastAPI 接口层
│ │ ├── application/ # 用例编排层
│ │ ├── domain/ # 领域模型与稳定端口
│ │ ├── infrastructure/ # MinIO / Milvus / 阿里云 / embedding / session 适配
│ │ ├── config/ # 配置与日志
│ │ └── workers/
│ ├── requirements.txt
│ └── main.py
├── frontend/ # Vite React 前端
├── tests/ # 根级测试,导入 backend/app
├── docker/
│ └── docker-compose.yml
├── pyproject.toml
└── .env.example
```
## 快速开始
### 1. 安装依赖
```bash
./dev.sh setup
```
### 2. 启动Milvus向量数据库
```bash
cd docker
docker-compose up -d
```
等待Milvus启动完成约30秒
```bash
docker-compose logs -f milvus
```
### 3. 启动API服务
```bash
./dev.sh start api --foreground
```
访问API文档http://localhost:8000/docs
## API接口
### 上传文档
```bash
curl -X POST http://localhost:8000/api/v1/documents/upload \
-F "file=@your_regulation.pdf" \
-F "doc_name=GB 7258-2017" \
-F "regulation_type=车辆安全"
```
### 检索法规
```bash
curl -X POST http://localhost:8000/api/v1/knowledge/search \
-H "Content-Type: application/json" \
-d '{"query": "机动车安全技术要求", "top_k": 10}'
```
## 技术栈
| 类别 | 技术 |
|------|------|
| 文档解析 | 阿里云文档智能 + python-docx |
| 分块策略 | 阿里云 `vector_chunks` |
| 嵌入模型 | `text-embedding-v3`1536维 Dense |
| 向量数据库 | Milvus 2.4本地Docker部署 |
| 检索方式 | Dense-only 检索 |
| API框架 | FastAPI |
## 配置
创建 `.env` 文件(参考 `.env.example`
```env
# Milvus配置
MILVUS_HOST=localhost
MILVUS_PORT=19530
# 阿里云文档解析
ALIBABA_ACCESS_KEY_ID=your_aliyun_access_key_id
ALIBABA_ACCESS_KEY_SECRET=your_aliyun_access_key_secret
PARSER_BACKEND=aliyun
CHUNK_BACKEND=aliyun
# embedding 配置
EMBEDDING_MODEL=text-embedding-v3
EMBEDDING_DIM=1536
EMBEDDING_API_KEY=your_embedding_api_key_here
# 分块配置
CHUNK_SIZE=512
```
## 后续迭代不在本次MVP范围
- LLM摘要生成当前上传主链路默认不生成
- 文档上传UI界面
- 混合检索问答功能
- 法规变更监控与自动更新
## 解析产物
上传成功后,系统会把阿里云解析的中间结果持久化到 MinIO
- `artifacts/{doc_id}/layouts.json`
- `artifacts/{doc_id}/structure_nodes.json`
- `artifacts/{doc_id}/semantic_blocks.json`
- `artifacts/{doc_id}/vector_chunks.json`
当前默认 Milvus collection 为 `regulations_dense_1536_v2`
## 许可证
MIT License