# Document AI Indexer An intelligent document processing and indexing system based on Azure AI services, supporting content extraction, processing, and vectorized indexing for multiple document formats. ## Features ### 🚀 Core Features - **Multi-format Document Support**: PDF, DOCX, image formats, etc. - **Intelligent Content Extraction**: OCR and structured extraction using Azure Document Intelligence - **Document Chunking**: Smart document chunking and vectorization - **Azure AI Search Integration**: Automatically create search indexes and upload documents - **Metadata Management**: Complete document metadata extraction and management - **Hierarchy Structure Repair**: Automatically fix title hierarchy structure in Markdown documents ### 🔧 Technical Features - **Asynchronous Processing**: High-performance async processing based on asyncio - **Containerized Deployment**: Complete Docker and Kubernetes support - **Configuration Management**: Flexible YAML configuration file management - **Database Support**: SQLAlchemy ORM supporting multiple databases - **Resilient Processing**: Built-in retry mechanisms and error handling - **Monitoring & Logging**: Complete logging and progress monitoring ## System Architecture ```mermaid graph LR subgraph "Data Sources" DS[Document Sources
Blob Storage/Local] MD[Metadata
Extraction] end subgraph "Azure AI Services" ADI[Azure Document
Intelligence] AAS[Azure AI Search
Index] EMB[Vector
Embedding] end subgraph "Processing Pipeline" HF[Hierarchy
Fix] CH[Content
Chunking] end DS --> ADI MD --> HF ADI --> HF HF --> CH CH --> EMB EMB --> AAS style DS fill:#e1f5fe style ADI fill:#e8f5e8 style AAS fill:#fff3e0 style EMB fill:#f3e5f5 style HF fill:#ffebee style CH fill:#f1f8e9 ``` ### Document Processing Flow ```mermaid flowchart TD START([Document Input]) --> DOWNLOAD[Download Document] DOWNLOAD --> EXTRACT[AI Content Extraction] EXTRACT --> FIX[Hierarchy Structure Fix] FIX --> CHUNK[Content Chunking] CHUNK --> EMBED[Vector Embedding] EMBED --> INDEX[Search Index Upload] INDEX --> END([Processing Complete]) style START fill:#c8e6c9 style END fill:#c8e6c9 style EXTRACT fill:#e1f5fe style FIX fill:#fff3e0 style CHUNK fill:#f3e5f5 ``` ## Quick Start ### Requirements - Python 3.12+ - Azure subscription and related services For detailed deployment guides, please refer to: [Deployment.md](Deployment.md) ### Install Dependencies ```bash pip install -r requirements.txt ``` ### Configuration Files The system uses two main configuration files: - `config.yaml` - Business configuration (data source, index configuration, etc.) - `env.yaml` - Environment variable configuration (Azure service keys, etc.) **Quick Start Configuration:** ```yaml # env.yaml - Essential Azure services search_service_name: "https://your-search-service.search.windows.net" search_admin_key: "your-search-admin-key" form_rec_resource: "https://your-di-service.cognitiveservices.azure.com/" form_rec_key: "your-di-key" embedding_model_endpoint: "https://your-openai.openai.azure.com/..." embedding_model_key: "your-openai-key" # config.yaml - Basic data source data_configs: - data_path: "https://your-blob-storage.blob.core.windows.net/container?sas-token" index_schemas: - index_name: "your-knowledge-index" data_type: ["metadata", "document", "chunk"] ``` 📖 **Detailed configuration instructions**: See the complete configuration parameters and examples [Deployment.md - Configuration file preparation](Deployment.md#Configuration-file-preparation) ### Run Application ```bash # Direct execution python main.py # Or use predefined tasks # (In VS Code, use Ctrl+Shift+P -> Run Task) ``` ## 📚 Document Navigation - **[Deployment Guide (Deployment.md)](Deployment.md)** - Complete deployment guide, including Docker and Kubernetes deployments - **[Configuration instructions](Deployment.md#Configuration-file-preparation)** - Detailed configuration file description ## Project Structure ``` document-extractor/ ├── main.py # Application entry point ├── app_config.py # Configuration management ├── business_layer.py # Business logic layer ├── document_task_processor.py # Document task processor ├── di_extractor.py # Document Intelligence extractor ├── azure_index_service.py # Azure Search service ├── blob_service.py # Blob storage service ├── chunk_service.py # Document chunking service ├── hierarchy_fix.py # Hierarchy structure repair ├── database.py # Database models ├── entity_models.py # Entity models ├── utils.py # Utility functions ├── config.yaml # Business configuration ├── env.yaml # Environment configuration ├── requirements.txt # Dependencies ├── Dockerfile # Docker build file ├── pyproject.toml # Project configuration ├── build-script/ # Build scripts │ └── document-ai-indexer.sh ├── deploy/ # Deployment files │ ├── document-ai-indexer.sh │ ├── document-ai-indexer_k8s.yml │ ├── document-ai-indexer_cronjob.yml │ └── embedding-api-proxy_k8s.yml └── doc/ # Documentation ``` ## Core Components ### 1. Document Processing Pipeline - **Document Loading**: Support loading from Azure Blob Storage or local file system - **Content Extraction**: OCR and structured extraction using Azure Document Intelligence - **Content Chunking**: Smart chunking algorithms maintaining semantic integrity - **Vectorization**: Generate vector representations of document content ### 2. Index Management - **Dynamic Index Creation**: Automatically create Azure AI Search indexes based on configuration - **Batch Upload**: Efficient batch document upload - **Metadata Management**: Complete document metadata indexing - **Incremental Updates**: Support incremental document updates ### 3. Data Processing - **Hierarchy Structure Repair**: Automatically fix title hierarchy in Markdown documents - **Metadata Extraction**: Extract structured metadata from documents and filenames - **Format Conversion**: Unified processing support for multiple document formats ## API and Integration ### Azure Service Integration - **Azure Document Intelligence**: Document analysis and OCR - **Azure AI Search**: Search indexing and querying - **Azure Blob Storage**: Document storage - **Azure OpenAI**: Vector embedding generation ### Database Support - PostgreSQL (recommended) - SQLite (development and testing) - Other SQLAlchemy-supported databases ## Monitoring and Logging The system provides comprehensive logging capabilities: - Processing progress monitoring - Error logging - Performance statistics - Task status tracking View logs: ```bash # Kubernetes environment kubectl logs -f document-ai-indexer -n knowledge-agent # Docker environment docker logs -f ``` ## Development ### Development Mode ```bash # Activate virtual environment source .venv/bin/activate # Linux/Mac # or .venv\Scripts\activate # Windows # Install development dependencies pip install -e .[dev,test] # Run code checks mypy . ``` ### Log Analysis ```bash # View error logs kubectl logs document-ai-indexer -n knowledge-agent | grep ERROR # View processing progress kubectl logs document-ai-indexer -n knowledge-agent | grep "Processing" ``` ## Version Information - **Current Version**: 0.20.4 - **Python Version**: 3.12+ - **Main Dependencies**: - azure-ai-documentintelligence - azure-search-documents - SQLAlchemy 2.0.41 - openai 1.55.3 --- *Last updated: August 2025*