catonline_ai/vw-document-ai-indexer/README.md

# Document AI Indexer

An intelligent document processing and indexing system based on Azure AI services, supporting content extraction, processing, and vectorized indexing for multiple document formats.

## Features

### 🚀 Core Features
- **Multi-format Document Support**: PDF, DOCX, image formats, etc.
- **Intelligent Content Extraction**: OCR and structured extraction using Azure Document Intelligence
- **Document Chunking**: Smart document chunking and vectorization
- **Azure AI Search Integration**: Automatically create search indexes and upload documents
- **Metadata Management**: Complete document metadata extraction and management
- **Hierarchy Structure Repair**: Automatically fix title hierarchy structure in Markdown documents

### 🔧 Technical Features
- **Asynchronous Processing**: High-performance async processing based on asyncio
- **Containerized Deployment**: Complete Docker and Kubernetes support
- **Configuration Management**: Flexible YAML configuration file management
- **Database Support**: SQLAlchemy ORM supporting multiple databases
- **Resilient Processing**: Built-in retry mechanisms and error handling
- **Monitoring & Logging**: Complete logging and progress monitoring

## System Architecture

```mermaid
graph LR
    subgraph "Data Sources"
        DS[Document Sources<br/>Blob Storage/Local]
        MD[Metadata<br/>Extraction]
    end

    subgraph "Azure AI Services"
        ADI[Azure Document<br/>Intelligence]
        AAS[Azure AI Search<br/>Index]
        EMB[Vector<br/>Embedding]
    end

    subgraph "Processing Pipeline"
        HF[Hierarchy<br/>Fix]
        CH[Content<br/>Chunking]
    end

    DS --> ADI
    MD --> HF
    ADI --> HF
    HF --> CH
    CH --> EMB
    EMB --> AAS

    style DS fill:#e1f5fe
    style ADI fill:#e8f5e8
    style AAS fill:#fff3e0
    style EMB fill:#f3e5f5
    style HF fill:#ffebee
    style CH fill:#f1f8e9
```

### Document Processing Flow

```mermaid
flowchart TD
    START([Document Input]) --> DOWNLOAD[Download Document]
    DOWNLOAD --> EXTRACT[AI Content Extraction]
    EXTRACT --> FIX[Hierarchy Structure Fix]
    FIX --> CHUNK[Content Chunking]
    CHUNK --> EMBED[Vector Embedding]
    EMBED --> INDEX[Search Index Upload]
    INDEX --> END([Processing Complete])

    style START fill:#c8e6c9
    style END fill:#c8e6c9
    style EXTRACT fill:#e1f5fe
    style FIX fill:#fff3e0
    style CHUNK fill:#f3e5f5
```

## Quick Start

### Requirements

- Python 3.12+
- Azure subscription and related services

For detailed deployment guides, please refer to: [Deployment.md](Deployment.md)

### Install Dependencies

```bash
pip install -r requirements.txt
```

### Configuration Files

The system uses two main configuration files:

- `config.yaml` - Business configuration (data source, index configuration, etc.)
- `env.yaml` - Environment variable configuration (Azure service keys, etc.)

**Quick Start Configuration:**

```yaml
# env.yaml - Essential Azure services
search_service_name: "https://your-search-service.search.windows.net"
search_admin_key: "your-search-admin-key"
form_rec_resource: "https://your-di-service.cognitiveservices.azure.com/"
form_rec_key: "your-di-key"
embedding_model_endpoint: "https://your-openai.openai.azure.com/..."
embedding_model_key: "your-openai-key"

# config.yaml - Basic data source
data_configs:
  - data_path: "https://your-blob-storage.blob.core.windows.net/container?sas-token"
    index_schemas:
      - index_name: "your-knowledge-index"
        data_type: ["metadata", "document", "chunk"]
```

📖 **Detailed configuration instructions**: See the complete configuration parameters and examples [Deployment.md - Configuration file preparation](Deployment.md#Configuration-file-preparation)

### Run Application

```bash
# Direct execution
python main.py

# Or use predefined tasks
# (In VS Code, use Ctrl+Shift+P -> Run Task)
```

## 📚 Document Navigation

- **[Deployment Guide (Deployment.md)](Deployment.md)** - Complete deployment guide, including Docker and Kubernetes deployments
- **[Configuration instructions](Deployment.md#Configuration-file-preparation)** - Detailed configuration file description

## Project Structure

```
document-extractor/
├── main.py                     # Application entry point
├── app_config.py              # Configuration management
├── business_layer.py          # Business logic layer
├── document_task_processor.py # Document task processor
├── di_extractor.py           # Document Intelligence extractor
├── azure_index_service.py    # Azure Search service
├── blob_service.py           # Blob storage service
├── chunk_service.py          # Document chunking service
├── hierarchy_fix.py          # Hierarchy structure repair
├── database.py               # Database models
├── entity_models.py          # Entity models
├── utils.py                  # Utility functions
├── config.yaml               # Business configuration
├── env.yaml                  # Environment configuration
├── requirements.txt          # Dependencies
├── Dockerfile               # Docker build file
├── pyproject.toml           # Project configuration
├── build-script/            # Build scripts
│   └── document-ai-indexer.sh
├── deploy/                  # Deployment files
│   ├── document-ai-indexer.sh
│   ├── document-ai-indexer_k8s.yml
│   ├── document-ai-indexer_cronjob.yml
│   └── embedding-api-proxy_k8s.yml
└── doc/                     # Documentation
```

## Core Components

### 1. Document Processing Pipeline

- **Document Loading**: Support loading from Azure Blob Storage or local file system
- **Content Extraction**: OCR and structured extraction using Azure Document Intelligence
- **Content Chunking**: Smart chunking algorithms maintaining semantic integrity
- **Vectorization**: Generate vector representations of document content

### 2. Index Management

- **Dynamic Index Creation**: Automatically create Azure AI Search indexes based on configuration
- **Batch Upload**: Efficient batch document upload
- **Metadata Management**: Complete document metadata indexing
- **Incremental Updates**: Support incremental document updates

### 3. Data Processing

- **Hierarchy Structure Repair**: Automatically fix title hierarchy in Markdown documents
- **Metadata Extraction**: Extract structured metadata from documents and filenames
- **Format Conversion**: Unified processing support for multiple document formats


## API and Integration

### Azure Service Integration
- **Azure Document Intelligence**: Document analysis and OCR
- **Azure AI Search**: Search indexing and querying
- **Azure Blob Storage**: Document storage
- **Azure OpenAI**: Vector embedding generation

### Database Support
- PostgreSQL (recommended)
- SQLite (development and testing)
- Other SQLAlchemy-supported databases

## Monitoring and Logging

The system provides comprehensive logging capabilities:
- Processing progress monitoring
- Error logging
- Performance statistics
- Task status tracking

View logs:
```bash
# Kubernetes environment
kubectl logs -f document-ai-indexer -n knowledge-agent

# Docker environment
docker logs -f <container-id>
```


## Development

### Development Mode

```bash
# Activate virtual environment
source .venv/bin/activate  # Linux/Mac
# or
.venv\Scripts\activate     # Windows

# Install development dependencies
pip install -e .[dev,test]

# Run code checks
mypy .
```


### Log Analysis
```bash
# View error logs
kubectl logs document-ai-indexer -n knowledge-agent | grep ERROR

# View processing progress
kubectl logs document-ai-indexer -n knowledge-agent | grep "Processing"
```

## Version Information

- **Current Version**: 0.20.4
- **Python Version**: 3.12+
- **Main Dependencies**:
  - azure-ai-documentintelligence
  - azure-search-documents
  - SQLAlchemy 2.0.41
  - openai 1.55.3


---

*Last updated: August 2025*