7.8 KiB
7.8 KiB
Document AI Indexer
An intelligent document processing and indexing system based on Azure AI services, supporting content extraction, processing, and vectorized indexing for multiple document formats.
Features
🚀 Core Features
- Multi-format Document Support: PDF, DOCX, image formats, etc.
- Intelligent Content Extraction: OCR and structured extraction using Azure Document Intelligence
- Document Chunking: Smart document chunking and vectorization
- Azure AI Search Integration: Automatically create search indexes and upload documents
- Metadata Management: Complete document metadata extraction and management
- Hierarchy Structure Repair: Automatically fix title hierarchy structure in Markdown documents
🔧 Technical Features
- Asynchronous Processing: High-performance async processing based on asyncio
- Containerized Deployment: Complete Docker and Kubernetes support
- Configuration Management: Flexible YAML configuration file management
- Database Support: SQLAlchemy ORM supporting multiple databases
- Resilient Processing: Built-in retry mechanisms and error handling
- Monitoring & Logging: Complete logging and progress monitoring
System Architecture
graph LR
subgraph "Data Sources"
DS[Document Sources<br/>Blob Storage/Local]
MD[Metadata<br/>Extraction]
end
subgraph "Azure AI Services"
ADI[Azure Document<br/>Intelligence]
AAS[Azure AI Search<br/>Index]
EMB[Vector<br/>Embedding]
end
subgraph "Processing Pipeline"
HF[Hierarchy<br/>Fix]
CH[Content<br/>Chunking]
end
DS --> ADI
MD --> HF
ADI --> HF
HF --> CH
CH --> EMB
EMB --> AAS
style DS fill:#e1f5fe
style ADI fill:#e8f5e8
style AAS fill:#fff3e0
style EMB fill:#f3e5f5
style HF fill:#ffebee
style CH fill:#f1f8e9
Document Processing Flow
flowchart TD
START([Document Input]) --> DOWNLOAD[Download Document]
DOWNLOAD --> EXTRACT[AI Content Extraction]
EXTRACT --> FIX[Hierarchy Structure Fix]
FIX --> CHUNK[Content Chunking]
CHUNK --> EMBED[Vector Embedding]
EMBED --> INDEX[Search Index Upload]
INDEX --> END([Processing Complete])
style START fill:#c8e6c9
style END fill:#c8e6c9
style EXTRACT fill:#e1f5fe
style FIX fill:#fff3e0
style CHUNK fill:#f3e5f5
Quick Start
Requirements
- Python 3.12+
- Azure subscription and related services
For detailed deployment guides, please refer to: Deployment.md
Install Dependencies
pip install -r requirements.txt
Configuration Files
The system uses two main configuration files:
config.yaml- Business configuration (data source, index configuration, etc.)env.yaml- Environment variable configuration (Azure service keys, etc.)
Quick Start Configuration:
# env.yaml - Essential Azure services
search_service_name: "https://your-search-service.search.windows.net"
search_admin_key: "your-search-admin-key"
form_rec_resource: "https://your-di-service.cognitiveservices.azure.com/"
form_rec_key: "your-di-key"
embedding_model_endpoint: "https://your-openai.openai.azure.com/..."
embedding_model_key: "your-openai-key"
# config.yaml - Basic data source
data_configs:
- data_path: "https://your-blob-storage.blob.core.windows.net/container?sas-token"
index_schemas:
- index_name: "your-knowledge-index"
data_type: ["metadata", "document", "chunk"]
📖 Detailed configuration instructions: See the complete configuration parameters and examples Deployment.md - Configuration file preparation
Run Application
# Direct execution
python main.py
# Or use predefined tasks
# (In VS Code, use Ctrl+Shift+P -> Run Task)
📚 Document Navigation
- Deployment Guide (Deployment.md) - Complete deployment guide, including Docker and Kubernetes deployments
- Configuration instructions - Detailed configuration file description
Project Structure
document-extractor/
├── main.py # Application entry point
├── app_config.py # Configuration management
├── business_layer.py # Business logic layer
├── document_task_processor.py # Document task processor
├── di_extractor.py # Document Intelligence extractor
├── azure_index_service.py # Azure Search service
├── blob_service.py # Blob storage service
├── chunk_service.py # Document chunking service
├── hierarchy_fix.py # Hierarchy structure repair
├── database.py # Database models
├── entity_models.py # Entity models
├── utils.py # Utility functions
├── config.yaml # Business configuration
├── env.yaml # Environment configuration
├── requirements.txt # Dependencies
├── Dockerfile # Docker build file
├── pyproject.toml # Project configuration
├── build-script/ # Build scripts
│ └── document-ai-indexer.sh
├── deploy/ # Deployment files
│ ├── document-ai-indexer.sh
│ ├── document-ai-indexer_k8s.yml
│ ├── document-ai-indexer_cronjob.yml
│ └── embedding-api-proxy_k8s.yml
└── doc/ # Documentation
Core Components
1. Document Processing Pipeline
- Document Loading: Support loading from Azure Blob Storage or local file system
- Content Extraction: OCR and structured extraction using Azure Document Intelligence
- Content Chunking: Smart chunking algorithms maintaining semantic integrity
- Vectorization: Generate vector representations of document content
2. Index Management
- Dynamic Index Creation: Automatically create Azure AI Search indexes based on configuration
- Batch Upload: Efficient batch document upload
- Metadata Management: Complete document metadata indexing
- Incremental Updates: Support incremental document updates
3. Data Processing
- Hierarchy Structure Repair: Automatically fix title hierarchy in Markdown documents
- Metadata Extraction: Extract structured metadata from documents and filenames
- Format Conversion: Unified processing support for multiple document formats
API and Integration
Azure Service Integration
- Azure Document Intelligence: Document analysis and OCR
- Azure AI Search: Search indexing and querying
- Azure Blob Storage: Document storage
- Azure OpenAI: Vector embedding generation
Database Support
- PostgreSQL (recommended)
- SQLite (development and testing)
- Other SQLAlchemy-supported databases
Monitoring and Logging
The system provides comprehensive logging capabilities:
- Processing progress monitoring
- Error logging
- Performance statistics
- Task status tracking
View logs:
# Kubernetes environment
kubectl logs -f document-ai-indexer -n knowledge-agent
# Docker environment
docker logs -f <container-id>
Development
Development Mode
# Activate virtual environment
source .venv/bin/activate # Linux/Mac
# or
.venv\Scripts\activate # Windows
# Install development dependencies
pip install -e .[dev,test]
# Run code checks
mypy .
Log Analysis
# View error logs
kubectl logs document-ai-indexer -n knowledge-agent | grep ERROR
# View processing progress
kubectl logs document-ai-indexer -n knowledge-agent | grep "Processing"
Version Information
- Current Version: 0.20.4
- Python Version: 3.12+
- Main Dependencies:
- azure-ai-documentintelligence
- azure-search-documents
- SQLAlchemy 2.0.41
- openai 1.55.3
Last updated: August 2025