Files
catonline_ai/vw-document-ai-indexer
2025-09-26 17:15:54 +08:00
..
2025-09-26 17:15:54 +08:00
2025-09-26 17:15:54 +08:00
2025-09-26 17:15:54 +08:00
2025-09-26 17:15:54 +08:00
2025-09-26 17:15:54 +08:00
2025-09-26 17:15:54 +08:00
2025-09-26 17:15:54 +08:00
2025-09-26 17:15:54 +08:00
2025-09-26 17:15:54 +08:00
2025-09-26 17:15:54 +08:00
2025-09-26 17:15:54 +08:00
2025-09-26 17:15:54 +08:00
2025-09-26 17:15:54 +08:00
2025-09-26 17:15:54 +08:00
2025-09-26 17:15:54 +08:00
2025-09-26 17:15:54 +08:00
2025-09-26 17:15:54 +08:00
2025-09-26 17:15:54 +08:00
2025-09-26 17:15:54 +08:00
2025-09-26 17:15:54 +08:00
2025-09-26 17:15:54 +08:00
2025-09-26 17:15:54 +08:00
2025-09-26 17:15:54 +08:00
2025-09-26 17:15:54 +08:00
2025-09-26 17:15:54 +08:00
2025-09-26 17:15:54 +08:00
2025-09-26 17:15:54 +08:00
2025-09-26 17:15:54 +08:00
2025-09-26 17:15:54 +08:00
2025-09-26 17:15:54 +08:00
2025-09-26 17:15:54 +08:00

Document AI Indexer

An intelligent document processing and indexing system based on Azure AI services, supporting content extraction, processing, and vectorized indexing for multiple document formats.

Features

🚀 Core Features

  • Multi-format Document Support: PDF, DOCX, image formats, etc.
  • Intelligent Content Extraction: OCR and structured extraction using Azure Document Intelligence
  • Document Chunking: Smart document chunking and vectorization
  • Azure AI Search Integration: Automatically create search indexes and upload documents
  • Metadata Management: Complete document metadata extraction and management
  • Hierarchy Structure Repair: Automatically fix title hierarchy structure in Markdown documents

🔧 Technical Features

  • Asynchronous Processing: High-performance async processing based on asyncio
  • Containerized Deployment: Complete Docker and Kubernetes support
  • Configuration Management: Flexible YAML configuration file management
  • Database Support: SQLAlchemy ORM supporting multiple databases
  • Resilient Processing: Built-in retry mechanisms and error handling
  • Monitoring & Logging: Complete logging and progress monitoring

System Architecture

graph LR
    subgraph "Data Sources"
        DS[Document Sources<br/>Blob Storage/Local]
        MD[Metadata<br/>Extraction]
    end
    
    subgraph "Azure AI Services"
        ADI[Azure Document<br/>Intelligence]
        AAS[Azure AI Search<br/>Index]
        EMB[Vector<br/>Embedding]
    end
    
    subgraph "Processing Pipeline"
        HF[Hierarchy<br/>Fix]
        CH[Content<br/>Chunking]
    end
    
    DS --> ADI
    MD --> HF
    ADI --> HF
    HF --> CH
    CH --> EMB
    EMB --> AAS
    
    style DS fill:#e1f5fe
    style ADI fill:#e8f5e8
    style AAS fill:#fff3e0
    style EMB fill:#f3e5f5
    style HF fill:#ffebee
    style CH fill:#f1f8e9

Document Processing Flow

flowchart TD
    START([Document Input]) --> DOWNLOAD[Download Document]
    DOWNLOAD --> EXTRACT[AI Content Extraction]
    EXTRACT --> FIX[Hierarchy Structure Fix]
    FIX --> CHUNK[Content Chunking]
    CHUNK --> EMBED[Vector Embedding]
    EMBED --> INDEX[Search Index Upload]
    INDEX --> END([Processing Complete])
    
    style START fill:#c8e6c9
    style END fill:#c8e6c9
    style EXTRACT fill:#e1f5fe
    style FIX fill:#fff3e0
    style CHUNK fill:#f3e5f5

Quick Start

Requirements

  • Python 3.12+
  • Azure subscription and related services

For detailed deployment guides, please refer to: Deployment.md

Install Dependencies

pip install -r requirements.txt

Configuration Files

The system uses two main configuration files:

  • config.yaml - Business configuration (data source, index configuration, etc.)
  • env.yaml - Environment variable configuration (Azure service keys, etc.)

Quick Start Configuration:

# env.yaml - Essential Azure services
search_service_name: "https://your-search-service.search.windows.net"
search_admin_key: "your-search-admin-key"
form_rec_resource: "https://your-di-service.cognitiveservices.azure.com/"
form_rec_key: "your-di-key"
embedding_model_endpoint: "https://your-openai.openai.azure.com/..."
embedding_model_key: "your-openai-key"

# config.yaml - Basic data source
data_configs:
  - data_path: "https://your-blob-storage.blob.core.windows.net/container?sas-token"
    index_schemas:
      - index_name: "your-knowledge-index"
        data_type: ["metadata", "document", "chunk"]

📖 Detailed configuration instructions: See the complete configuration parameters and examples Deployment.md - Configuration file preparation

Run Application

# Direct execution
python main.py

# Or use predefined tasks
# (In VS Code, use Ctrl+Shift+P -> Run Task)

📚 Document Navigation

Project Structure

document-extractor/
├── main.py                     # Application entry point
├── app_config.py              # Configuration management
├── business_layer.py          # Business logic layer
├── document_task_processor.py # Document task processor
├── di_extractor.py           # Document Intelligence extractor
├── azure_index_service.py    # Azure Search service
├── blob_service.py           # Blob storage service
├── chunk_service.py          # Document chunking service
├── hierarchy_fix.py          # Hierarchy structure repair
├── database.py               # Database models
├── entity_models.py          # Entity models
├── utils.py                  # Utility functions
├── config.yaml               # Business configuration
├── env.yaml                  # Environment configuration
├── requirements.txt          # Dependencies
├── Dockerfile               # Docker build file
├── pyproject.toml           # Project configuration
├── build-script/            # Build scripts
│   └── document-ai-indexer.sh
├── deploy/                  # Deployment files
│   ├── document-ai-indexer.sh
│   ├── document-ai-indexer_k8s.yml
│   ├── document-ai-indexer_cronjob.yml
│   └── embedding-api-proxy_k8s.yml
└── doc/                     # Documentation

Core Components

1. Document Processing Pipeline

  • Document Loading: Support loading from Azure Blob Storage or local file system
  • Content Extraction: OCR and structured extraction using Azure Document Intelligence
  • Content Chunking: Smart chunking algorithms maintaining semantic integrity
  • Vectorization: Generate vector representations of document content

2. Index Management

  • Dynamic Index Creation: Automatically create Azure AI Search indexes based on configuration
  • Batch Upload: Efficient batch document upload
  • Metadata Management: Complete document metadata indexing
  • Incremental Updates: Support incremental document updates

3. Data Processing

  • Hierarchy Structure Repair: Automatically fix title hierarchy in Markdown documents
  • Metadata Extraction: Extract structured metadata from documents and filenames
  • Format Conversion: Unified processing support for multiple document formats

API and Integration

Azure Service Integration

  • Azure Document Intelligence: Document analysis and OCR
  • Azure AI Search: Search indexing and querying
  • Azure Blob Storage: Document storage
  • Azure OpenAI: Vector embedding generation

Database Support

  • PostgreSQL (recommended)
  • SQLite (development and testing)
  • Other SQLAlchemy-supported databases

Monitoring and Logging

The system provides comprehensive logging capabilities:

  • Processing progress monitoring
  • Error logging
  • Performance statistics
  • Task status tracking

View logs:

# Kubernetes environment
kubectl logs -f document-ai-indexer -n knowledge-agent

# Docker environment
docker logs -f <container-id>

Development

Development Mode

# Activate virtual environment
source .venv/bin/activate  # Linux/Mac
# or
.venv\Scripts\activate     # Windows

# Install development dependencies
pip install -e .[dev,test]

# Run code checks
mypy .

Log Analysis

# View error logs
kubectl logs document-ai-indexer -n knowledge-agent | grep ERROR

# View processing progress
kubectl logs document-ai-indexer -n knowledge-agent | grep "Processing"

Version Information

  • Current Version: 0.20.4
  • Python Version: 3.12+
  • Main Dependencies:
    • azure-ai-documentintelligence
    • azure-search-documents
    • SQLAlchemy 2.0.41
    • openai 1.55.3

Last updated: August 2025