Files
2025-09-26 17:15:54 +08:00

261 lines
7.8 KiB
Markdown

# Document AI Indexer
An intelligent document processing and indexing system based on Azure AI services, supporting content extraction, processing, and vectorized indexing for multiple document formats.
## Features
### 🚀 Core Features
- **Multi-format Document Support**: PDF, DOCX, image formats, etc.
- **Intelligent Content Extraction**: OCR and structured extraction using Azure Document Intelligence
- **Document Chunking**: Smart document chunking and vectorization
- **Azure AI Search Integration**: Automatically create search indexes and upload documents
- **Metadata Management**: Complete document metadata extraction and management
- **Hierarchy Structure Repair**: Automatically fix title hierarchy structure in Markdown documents
### 🔧 Technical Features
- **Asynchronous Processing**: High-performance async processing based on asyncio
- **Containerized Deployment**: Complete Docker and Kubernetes support
- **Configuration Management**: Flexible YAML configuration file management
- **Database Support**: SQLAlchemy ORM supporting multiple databases
- **Resilient Processing**: Built-in retry mechanisms and error handling
- **Monitoring & Logging**: Complete logging and progress monitoring
## System Architecture
```mermaid
graph LR
subgraph "Data Sources"
DS[Document Sources<br/>Blob Storage/Local]
MD[Metadata<br/>Extraction]
end
subgraph "Azure AI Services"
ADI[Azure Document<br/>Intelligence]
AAS[Azure AI Search<br/>Index]
EMB[Vector<br/>Embedding]
end
subgraph "Processing Pipeline"
HF[Hierarchy<br/>Fix]
CH[Content<br/>Chunking]
end
DS --> ADI
MD --> HF
ADI --> HF
HF --> CH
CH --> EMB
EMB --> AAS
style DS fill:#e1f5fe
style ADI fill:#e8f5e8
style AAS fill:#fff3e0
style EMB fill:#f3e5f5
style HF fill:#ffebee
style CH fill:#f1f8e9
```
### Document Processing Flow
```mermaid
flowchart TD
START([Document Input]) --> DOWNLOAD[Download Document]
DOWNLOAD --> EXTRACT[AI Content Extraction]
EXTRACT --> FIX[Hierarchy Structure Fix]
FIX --> CHUNK[Content Chunking]
CHUNK --> EMBED[Vector Embedding]
EMBED --> INDEX[Search Index Upload]
INDEX --> END([Processing Complete])
style START fill:#c8e6c9
style END fill:#c8e6c9
style EXTRACT fill:#e1f5fe
style FIX fill:#fff3e0
style CHUNK fill:#f3e5f5
```
## Quick Start
### Requirements
- Python 3.12+
- Azure subscription and related services
For detailed deployment guides, please refer to: [Deployment.md](Deployment.md)
### Install Dependencies
```bash
pip install -r requirements.txt
```
### Configuration Files
The system uses two main configuration files:
- `config.yaml` - Business configuration (data source, index configuration, etc.)
- `env.yaml` - Environment variable configuration (Azure service keys, etc.)
**Quick Start Configuration:**
```yaml
# env.yaml - Essential Azure services
search_service_name: "https://your-search-service.search.windows.net"
search_admin_key: "your-search-admin-key"
form_rec_resource: "https://your-di-service.cognitiveservices.azure.com/"
form_rec_key: "your-di-key"
embedding_model_endpoint: "https://your-openai.openai.azure.com/..."
embedding_model_key: "your-openai-key"
# config.yaml - Basic data source
data_configs:
- data_path: "https://your-blob-storage.blob.core.windows.net/container?sas-token"
index_schemas:
- index_name: "your-knowledge-index"
data_type: ["metadata", "document", "chunk"]
```
📖 **Detailed configuration instructions**: See the complete configuration parameters and examples [Deployment.md - Configuration file preparation](Deployment.md#Configuration-file-preparation)
### Run Application
```bash
# Direct execution
python main.py
# Or use predefined tasks
# (In VS Code, use Ctrl+Shift+P -> Run Task)
```
## 📚 Document Navigation
- **[Deployment Guide (Deployment.md)](Deployment.md)** - Complete deployment guide, including Docker and Kubernetes deployments
- **[Configuration instructions](Deployment.md#Configuration-file-preparation)** - Detailed configuration file description
## Project Structure
```
document-extractor/
├── main.py # Application entry point
├── app_config.py # Configuration management
├── business_layer.py # Business logic layer
├── document_task_processor.py # Document task processor
├── di_extractor.py # Document Intelligence extractor
├── azure_index_service.py # Azure Search service
├── blob_service.py # Blob storage service
├── chunk_service.py # Document chunking service
├── hierarchy_fix.py # Hierarchy structure repair
├── database.py # Database models
├── entity_models.py # Entity models
├── utils.py # Utility functions
├── config.yaml # Business configuration
├── env.yaml # Environment configuration
├── requirements.txt # Dependencies
├── Dockerfile # Docker build file
├── pyproject.toml # Project configuration
├── build-script/ # Build scripts
│ └── document-ai-indexer.sh
├── deploy/ # Deployment files
│ ├── document-ai-indexer.sh
│ ├── document-ai-indexer_k8s.yml
│ ├── document-ai-indexer_cronjob.yml
│ └── embedding-api-proxy_k8s.yml
└── doc/ # Documentation
```
## Core Components
### 1. Document Processing Pipeline
- **Document Loading**: Support loading from Azure Blob Storage or local file system
- **Content Extraction**: OCR and structured extraction using Azure Document Intelligence
- **Content Chunking**: Smart chunking algorithms maintaining semantic integrity
- **Vectorization**: Generate vector representations of document content
### 2. Index Management
- **Dynamic Index Creation**: Automatically create Azure AI Search indexes based on configuration
- **Batch Upload**: Efficient batch document upload
- **Metadata Management**: Complete document metadata indexing
- **Incremental Updates**: Support incremental document updates
### 3. Data Processing
- **Hierarchy Structure Repair**: Automatically fix title hierarchy in Markdown documents
- **Metadata Extraction**: Extract structured metadata from documents and filenames
- **Format Conversion**: Unified processing support for multiple document formats
## API and Integration
### Azure Service Integration
- **Azure Document Intelligence**: Document analysis and OCR
- **Azure AI Search**: Search indexing and querying
- **Azure Blob Storage**: Document storage
- **Azure OpenAI**: Vector embedding generation
### Database Support
- PostgreSQL (recommended)
- SQLite (development and testing)
- Other SQLAlchemy-supported databases
## Monitoring and Logging
The system provides comprehensive logging capabilities:
- Processing progress monitoring
- Error logging
- Performance statistics
- Task status tracking
View logs:
```bash
# Kubernetes environment
kubectl logs -f document-ai-indexer -n knowledge-agent
# Docker environment
docker logs -f <container-id>
```
## Development
### Development Mode
```bash
# Activate virtual environment
source .venv/bin/activate # Linux/Mac
# or
.venv\Scripts\activate # Windows
# Install development dependencies
pip install -e .[dev,test]
# Run code checks
mypy .
```
### Log Analysis
```bash
# View error logs
kubectl logs document-ai-indexer -n knowledge-agent | grep ERROR
# View processing progress
kubectl logs document-ai-indexer -n knowledge-agent | grep "Processing"
```
## Version Information
- **Current Version**: 0.20.4
- **Python Version**: 3.12+
- **Main Dependencies**:
- azure-ai-documentintelligence
- azure-search-documents
- SQLAlchemy 2.0.41
- openai 1.55.3
---
*Last updated: August 2025*