Files

Ye Shijie db0e5965ec init

2025-09-26 17:15:54 +08:00

7.8 KiB

Raw Permalink Blame History

Document AI Indexer

An intelligent document processing and indexing system based on Azure AI services, supporting content extraction, processing, and vectorized indexing for multiple document formats.

Features

🚀 Core Features

Multi-format Document Support: PDF, DOCX, image formats, etc.
Intelligent Content Extraction: OCR and structured extraction using Azure Document Intelligence
Document Chunking: Smart document chunking and vectorization
Azure AI Search Integration: Automatically create search indexes and upload documents
Metadata Management: Complete document metadata extraction and management
Hierarchy Structure Repair: Automatically fix title hierarchy structure in Markdown documents

🔧 Technical Features

Asynchronous Processing: High-performance async processing based on asyncio
Containerized Deployment: Complete Docker and Kubernetes support
Configuration Management: Flexible YAML configuration file management
Database Support: SQLAlchemy ORM supporting multiple databases
Resilient Processing: Built-in retry mechanisms and error handling
Monitoring & Logging: Complete logging and progress monitoring

System Architecture

graph LR
    subgraph "Data Sources"
        DS[Document Sources<br/>Blob Storage/Local]
        MD[Metadata<br/>Extraction]
    end
    
    subgraph "Azure AI Services"
        ADI[Azure Document<br/>Intelligence]
        AAS[Azure AI Search<br/>Index]
        EMB[Vector<br/>Embedding]
    end
    
    subgraph "Processing Pipeline"
        HF[Hierarchy<br/>Fix]
        CH[Content<br/>Chunking]
    end
    
    DS --> ADI
    MD --> HF
    ADI --> HF
    HF --> CH
    CH --> EMB
    EMB --> AAS
    
    style DS fill:#e1f5fe
    style ADI fill:#e8f5e8
    style AAS fill:#fff3e0
    style EMB fill:#f3e5f5
    style HF fill:#ffebee
    style CH fill:#f1f8e9

Document Processing Flow

flowchart TD
    START([Document Input]) --> DOWNLOAD[Download Document]
    DOWNLOAD --> EXTRACT[AI Content Extraction]
    EXTRACT --> FIX[Hierarchy Structure Fix]
    FIX --> CHUNK[Content Chunking]
    CHUNK --> EMBED[Vector Embedding]
    EMBED --> INDEX[Search Index Upload]
    INDEX --> END([Processing Complete])
    
    style START fill:#c8e6c9
    style END fill:#c8e6c9
    style EXTRACT fill:#e1f5fe
    style FIX fill:#fff3e0
    style CHUNK fill:#f3e5f5

Quick Start

Requirements

Python 3.12+
Azure subscription and related services

For detailed deployment guides, please refer to: Deployment.md

Install Dependencies

pip install -r requirements.txt

Configuration Files

The system uses two main configuration files:

config.yaml - Business configuration (data source, index configuration, etc.)
env.yaml - Environment variable configuration (Azure service keys, etc.)

Quick Start Configuration:

# env.yaml - Essential Azure services
search_service_name: "https://your-search-service.search.windows.net"
search_admin_key: "your-search-admin-key"
form_rec_resource: "https://your-di-service.cognitiveservices.azure.com/"
form_rec_key: "your-di-key"
embedding_model_endpoint: "https://your-openai.openai.azure.com/..."
embedding_model_key: "your-openai-key"

# config.yaml - Basic data source
data_configs:
  - data_path: "https://your-blob-storage.blob.core.windows.net/container?sas-token"
    index_schemas:
      - index_name: "your-knowledge-index"
        data_type: ["metadata", "document", "chunk"]

📖 Detailed configuration instructions: See the complete configuration parameters and examples Deployment.md - Configuration file preparation

Run Application

# Direct execution
python main.py

# Or use predefined tasks
# (In VS Code, use Ctrl+Shift+P -> Run Task)

Deployment Guide (Deployment.md) - Complete deployment guide, including Docker and Kubernetes deployments
Configuration instructions - Detailed configuration file description

Project Structure

document-extractor/
├── main.py                     # Application entry point
├── app_config.py              # Configuration management
├── business_layer.py          # Business logic layer
├── document_task_processor.py # Document task processor
├── di_extractor.py           # Document Intelligence extractor
├── azure_index_service.py    # Azure Search service
├── blob_service.py           # Blob storage service
├── chunk_service.py          # Document chunking service
├── hierarchy_fix.py          # Hierarchy structure repair
├── database.py               # Database models
├── entity_models.py          # Entity models
├── utils.py                  # Utility functions
├── config.yaml               # Business configuration
├── env.yaml                  # Environment configuration
├── requirements.txt          # Dependencies
├── Dockerfile               # Docker build file
├── pyproject.toml           # Project configuration
├── build-script/            # Build scripts
│   └── document-ai-indexer.sh
├── deploy/                  # Deployment files
│   ├── document-ai-indexer.sh
│   ├── document-ai-indexer_k8s.yml
│   ├── document-ai-indexer_cronjob.yml
│   └── embedding-api-proxy_k8s.yml
└── doc/                     # Documentation

Core Components

1. Document Processing Pipeline

Document Loading: Support loading from Azure Blob Storage or local file system
Content Extraction: OCR and structured extraction using Azure Document Intelligence
Content Chunking: Smart chunking algorithms maintaining semantic integrity
Vectorization: Generate vector representations of document content

2. Index Management

Dynamic Index Creation: Automatically create Azure AI Search indexes based on configuration
Batch Upload: Efficient batch document upload
Metadata Management: Complete document metadata indexing
Incremental Updates: Support incremental document updates

3. Data Processing

Hierarchy Structure Repair: Automatically fix title hierarchy in Markdown documents
Metadata Extraction: Extract structured metadata from documents and filenames
Format Conversion: Unified processing support for multiple document formats

API and Integration

Azure Service Integration

Azure Document Intelligence: Document analysis and OCR
Azure AI Search: Search indexing and querying
Azure Blob Storage: Document storage
Azure OpenAI: Vector embedding generation

Database Support

PostgreSQL (recommended)
SQLite (development and testing)
Other SQLAlchemy-supported databases

Monitoring and Logging

The system provides comprehensive logging capabilities:

Processing progress monitoring
Error logging
Performance statistics
Task status tracking

View logs:

# Kubernetes environment
kubectl logs -f document-ai-indexer -n knowledge-agent

# Docker environment
docker logs -f <container-id>

Development

Development Mode

# Activate virtual environment
source .venv/bin/activate  # Linux/Mac
# or
.venv\Scripts\activate     # Windows

# Install development dependencies
pip install -e .[dev,test]

# Run code checks
mypy .

Log Analysis

# View error logs
kubectl logs document-ai-indexer -n knowledge-agent | grep ERROR

# View processing progress
kubectl logs document-ai-indexer -n knowledge-agent | grep "Processing"

Version Information

Current Version: 0.20.4
Python Version: 3.12+
Main Dependencies:
- azure-ai-documentintelligence
- azure-search-documents
- SQLAlchemy 2.0.41
- openai 1.55.3

Last updated: August 2025

7.8 KiB Raw Permalink Blame History