Files

Ye Shijie db0e5965ec init

2025-09-26 17:15:54 +08:00

38 KiB

Raw Permalink Blame History

Document AI Indexer - Design Document

Overview

The Document AI Indexer is an intelligent document processing and indexing system built on Azure AI services. It provides comprehensive document extraction, processing, and vectorized indexing capabilities for multiple document formats, enabling advanced search and retrieval functionality.

Design Philosophy

The system is designed with several key principles in mind:

Modularity and Separation of Concerns: The architecture follows a layered approach with clear separation between application logic, business logic, service layer, and data access. This ensures maintainability and allows for easy testing and modification of individual components.

Scalability and Performance: Built with asynchronous processing capabilities and horizontal scaling in mind. The system can handle large volumes of documents through configurable parallel processing and efficient resource utilization.

Resilience and Fault Tolerance: Implements comprehensive error handling, retry mechanisms, and graceful degradation to ensure reliable operation even when external services experience issues.

Configuration-Driven Architecture: Utilizes YAML-based configuration management that allows for flexible deployment across different environments without code changes.

Cloud-Native Design: Leverages Azure services for AI processing, storage, and search capabilities while maintaining vendor independence through abstraction layers.

Features

🚀 Core Features

Multi-format Document Support: Handles PDF, DOCX, images (JPEG, PNG, TIFF, etc.), and other document formats
Intelligent Content Extraction: Leverages Azure Document Intelligence for OCR and structured data extraction
Smart Document Chunking: Implements hierarchy-aware chunking with configurable token limits and overlap
Vector Search Integration: Automatic Azure AI Search index creation and document vectorization
Metadata Management: Complete extraction and management of document metadata and custom fields
Hierarchy Structure Repair: Automatic correction of title hierarchy structure in Markdown documents
Figure and Formula Extraction: Advanced extraction of visual elements and mathematical formulas

🔧 Technical Features

Asynchronous Processing: High-performance async processing using asyncio and task queues
Containerized Deployment: Complete Docker and Kubernetes support with configurable environments
Configuration Management: Flexible YAML-based configuration for different deployment scenarios
Database Support: SQLAlchemy ORM with support for multiple database backends
Resilient Processing: Built-in retry mechanisms, error handling, and fault tolerance
Monitoring & Logging: Comprehensive logging, progress monitoring, and processing statistics
Scalable Architecture: Horizontal scaling support through containerization and task distribution

System Architecture

The Document AI Indexer follows a multi-layered architecture designed for scalability, maintainability, and robust error handling. The system processes documents through a well-defined pipeline that transforms raw documents into searchable, vectorized content.

Architectural Patterns

Service Factory Pattern: The system uses a centralized ServiceFactory to manage dependencies and service creation. This pattern ensures consistent configuration across all services and enables easy testing through dependency injection.

Repository Pattern: Data access is abstracted through repository interfaces, allowing for different storage backends and simplified testing with mock implementations.

Command Pattern: Document processing tasks are encapsulated as commands that can be queued, retried, and executed asynchronously.

Pipeline Pattern: The document processing workflow follows a clear pipeline with distinct stages: extraction, hierarchy fixing, chunking, vectorization, and indexing.

High-Level Architecture

The high-level architecture represents a distributed, service-oriented system designed for scalable document processing and intelligent content extraction. The architecture emphasizes separation of concerns, fault tolerance, and cloud-native principles to handle enterprise-scale document processing workloads.

Architectural Overview

Multi-Layered Design: The system is organized into distinct functional layers that separate data ingestion, processing logic, AI services, and storage concerns. This layered approach enables independent scaling, testing, and maintenance of different system components.

Service-Oriented Architecture: Each major functional area is implemented as a distinct service or component group, enabling independent deployment, scaling, and maintenance. Services communicate through well-defined interfaces and can be replaced or upgraded independently.

Cloud-Native Integration: The architecture leverages Azure cloud services for AI processing, storage, and search capabilities while maintaining abstraction layers that enable portability and testing flexibility.

Event-Driven Processing: The system follows an event-driven model where document processing is triggered by events (new documents, configuration changes, etc.) and progresses through a series of processing stages with clear state transitions.

System Components and Responsibilities

Data Sources Layer: Manages document ingestion from various sources including Azure Blob Storage and local file systems. This layer handles authentication, access control, and metadata extraction from source systems. It provides a unified interface for document discovery regardless of the underlying storage mechanism.

Processing Engine Layer: Orchestrates the entire document processing workflow through a hierarchical task management system. The Main Application serves as the central coordinator, while the Task Processor manages work distribution and the Document Task Processor handles individual document processing operations with full state tracking and error recovery.

AI Services Layer: Provides intelligent document processing capabilities through integration with Azure AI services and optional Vision LLM systems. These services handle complex operations like OCR, layout analysis, content extraction, and embedding generation. The modular design allows for easy integration of additional AI services or replacement of existing ones.

Processing Pipeline Layer: Implements the core document transformation logic through a series of processing stages. Each stage has specific responsibilities: content extraction converts raw documents to structured text, hierarchy fixing normalizes document structure, chunking creates manageable content segments, and vector generation produces searchable embeddings.

Storage & Search Layer: Manages persistent data storage and search capabilities through a combination of relational database storage for metadata and state management, Azure AI Search for vector-based content search, and blob storage for processed content and temporary files.

Data Flow and Integration Patterns

Asynchronous Processing Flow: Documents flow through the system asynchronously, enabling high throughput and efficient resource utilization. Each processing stage can operate independently, with clear handoff points and state persistence between stages.

Fault-Tolerant Design: The architecture includes comprehensive error handling and recovery mechanisms at every level. Failed operations are tracked, logged, and can be retried with exponential backoff. The system maintains processing state to enable recovery from failures without losing work.

Scalability Patterns: The architecture supports both vertical and horizontal scaling through stateless processing components, connection pooling, and queue-based work distribution. Different components can be scaled independently based on their specific resource requirements and bottlenecks.

Configuration-Driven Behavior: The system behavior is largely controlled through configuration rather than code changes, enabling flexible deployment across different environments and use cases without requiring code modifications or redeployment.

graph TB
    subgraph "Data Sources"
        DS[Document Sources<br/>Azure Blob Storage/Local Files]
        META[Metadata<br/>Configuration]
    end
    
    subgraph "Processing Engine"
        MAIN[Main Application<br/>Orchestrator]
        TP[Task Processor<br/>Queue Management]
        DTP[Document Task<br/>Processor]
    end
    
    subgraph "AI Services"
        ADI[Azure Document<br/>Intelligence]
        EMBED[Embedding<br/>Service]
        VLLM[Vision LLM<br/>Optional]
    end
    
    subgraph "Processing Pipeline"
        EXTRACT[Content<br/>Extraction]
        HIERARCHY[Hierarchy<br/>Fix]
        CHUNK[Document<br/>Chunking]
        VECTOR[Vector<br/>Generation]
    end
    
    subgraph "Storage & Search"
        DB[(Database<br/>SQLAlchemy)]
        AAS[Azure AI Search<br/>Index]
        BLOB[Azure Blob<br/>Storage]
    end
    
    DS --> MAIN
    META --> MAIN
    MAIN --> TP
    TP --> DTP
    DTP --> EXTRACT
    
    EXTRACT --> ADI
    EXTRACT --> VLLM
    ADI --> HIERARCHY
    HIERARCHY --> CHUNK
    CHUNK --> VECTOR
    VECTOR --> EMBED
    
    DTP --> DB
    VECTOR --> AAS
    EXTRACT --> BLOB
    
    style DS fill:#e1f5fe
    style AI fill:#f3e5f5
    style STORAGE fill:#e8f5e8

Component Architecture

The component architecture illustrates the internal structure and dependencies between different layers of the system. Each layer has specific responsibilities and communicates through well-defined interfaces.

Application Layer: Handles application initialization, configuration loading, and high-level orchestration. The ApplicationContext manages the overall application state and provides access to configuration and services.

Business Layer: Contains the core business logic for document processing. The DocumentProcessingOrchestrator coordinates the entire processing workflow, while the DocumentProcessor handles individual document processing tasks.

Service Layer: Provides abstracted access to external services and resources. The ServiceFactory manages service creation and configuration, ensuring consistent behavior across the application.

Data Layer: Manages data persistence and retrieval through repository patterns and entity models. This layer abstracts database operations and provides a clean interface for data access.

graph LR
    subgraph "Application Layer"
        APP[DocumentProcessingApplication]
        CTX[ApplicationContext]
        CONFIG[ApplicationConfig]
    end
    
    subgraph "Business Layer"
        BL[Business Layer]
        ORCH[DocumentProcessingOrchestrator]
        PROC[DocumentProcessor]
        FACTORY[DocumentProcessingFactory]
    end
    
    subgraph "Service Layer"
        SF[ServiceFactory]
        DI[DocumentIntelligenceService]
        CHUNK[ChunkService]
        INDEX[AzureIndexService]
        BLOB[BlobService]
    end
    
    subgraph "Data Layer"
        DB[DatabaseInterface]
        REPO[DocumentRepository]
        MODELS[Entity Models]
    end
    
    APP --> BL
    CTX --> CONFIG
    APP --> CTX
    
    BL --> SF
    ORCH --> PROC
    FACTORY --> ORCH
    
    SF --> DI
    SF --> CHUNK
    SF --> INDEX
    SF --> BLOB
    
    PROC --> DB
    DB --> REPO
    REPO --> MODELS
    
    style APP fill:#bbdefb
    style BL fill:#c8e6c9
    style SF fill:#ffecb3
    style DB fill:#f8bbd9

Workflow

The document processing workflow is designed to handle large-scale document processing with fault tolerance and efficient resource utilization. The system processes documents asynchronously through a task-based architecture.

Processing Strategy

Asynchronous Task Processing: Documents are processed as individual tasks that can be executed in parallel. This approach maximizes throughput and allows for efficient resource utilization across multiple processing nodes.

Stateful Processing: Each document's processing state is tracked in the database, enabling recovery from failures and preventing duplicate processing. The system maintains detailed status information and processing history.

Batch Operations: Where possible, operations are batched to improve efficiency. This is particularly important for operations like embedding generation and search index uploads.

Retry Logic: Failed operations are automatically retried with exponential backoff. The system distinguishes between transient failures (which should be retried) and permanent failures (which should be logged and skipped).

Document Processing Workflow

sequenceDiagram
    participant USER as User/Scheduler
    participant MAIN as Main App
    participant TP as Task Processor
    participant DTP as Document Task Processor
    participant ORCH as Orchestrator
    participant ADI as Azure DI
    participant CHUNK as Chunk Service
    participant INDEX as Index Service
    participant DB as Database
    
    USER->>MAIN: Start Processing
    MAIN->>MAIN: Initialize Configuration
    MAIN->>DB: Initialize Database
    MAIN->>TP: Create Task Processor
    
    loop For Each Document
        MAIN->>TP: Submit Document Task
        TP->>DTP: Process Task
        DTP->>DB: Create/Update IndexObject
        DTP->>ORCH: Execute Processing
        
        ORCH->>ADI: Extract Document Content
        ADI-->>ORCH: Return Extracted Content
        
        ORCH->>ORCH: Fix Hierarchy
        ORCH->>CHUNK: Chunk Document
        CHUNK-->>ORCH: Return Chunks
        
        ORCH->>INDEX: Generate Embeddings
        INDEX-->>ORCH: Return Vectors
        
        ORCH->>INDEX: Upload to Search Index
        INDEX-->>ORCH: Confirm Upload
        
        ORCH-->>DTP: Return Processing Result
        DTP->>DB: Update IndexObject Status
        DTP-->>TP: Return Result
    end
    
    TP-->>MAIN: Processing Complete
    MAIN-->>USER: Return Statistics

Data Flow Architecture

The data flow architecture represents the end-to-end processing pipeline from document ingestion to search index publication. This design emphasizes fault tolerance, scalability, and efficient resource utilization throughout the processing lifecycle.

Design Principles for Data Flow

Pipeline-Based Processing: The data flow follows a clear pipeline pattern where each stage has specific responsibilities and well-defined inputs and outputs. This design enables parallel processing, easier debugging, and modular testing of individual stages.

Decision Points and Routing: The architecture includes intelligent decision points that route documents through appropriate processing paths based on their characteristics. This ensures optimal processing strategies for different document types while maintaining a unified interface.

State Management: Processing state is carefully managed throughout the pipeline, with persistent state stored in the database and transient state maintained in memory. This approach enables recovery from failures at any point in the pipeline.

Resource Optimization: The flow is designed to minimize resource usage through efficient batching, connection reuse, and memory management. Processing stages are optimized to balance throughput with resource consumption.

Processing Flow Stages

Initialization Phase: The system performs comprehensive initialization including configuration validation, database connectivity checks, and service authentication. This phase ensures that all dependencies are available before processing begins.

Discovery and Task Creation: Document sources are scanned to identify new or modified documents that require processing. Tasks are created based on configured criteria such as file modification dates and processing history.

Format Detection and Routing: Documents are analyzed to determine their format and complexity, enabling the system to select the most appropriate extraction method. This intelligent routing ensures optimal processing quality and efficiency.

Content Extraction: Multiple extraction paths are available depending on document characteristics. The system can leverage Azure Document Intelligence for complex documents, Vision LLM for advanced image analysis, or direct processing for simple text documents.

Content Enhancement: Extracted content undergoes enhancement through hierarchy fixing and structure normalization. This stage ensures that the processed content maintains logical structure and is suitable for effective chunking.

Vectorization and Indexing: The final stages convert processed content into searchable vectors and upload them to the search index. These operations are batched for efficiency and include comprehensive error handling and retry logic.

flowchart TD
    START([Start Processing]) --> INIT[Initialize Application]
    INIT --> LOAD_CONFIG[Load Configuration]
    LOAD_CONFIG --> INIT_DB[Initialize Database]
    INIT_DB --> SCAN_DOCS[Scan Document Sources]
    
    SCAN_DOCS --> CREATE_TASKS[Create Processing Tasks]
    CREATE_TASKS --> PROCESS_TASK{Process Each Task}
    
    PROCESS_TASK --> EXTRACT[Extract Content]
    EXTRACT --> CHECK_FORMAT{Check Document Format}
    
    CHECK_FORMAT -->|PDF/Images| USE_DI[Use Azure Document Intelligence]
    CHECK_FORMAT -->|Vision Mode| USE_VLLM[Use Vision LLM]
    CHECK_FORMAT -->|Text| DIRECT_PROCESS[Direct Processing]
    
    USE_DI --> EXTRACT_RESULT[Content + Metadata]
    USE_VLLM --> EXTRACT_RESULT
    DIRECT_PROCESS --> EXTRACT_RESULT
    
    EXTRACT_RESULT --> FIX_HIERARCHY[Fix Document Hierarchy]
    FIX_HIERARCHY --> CHUNK_DOC[Chunk Document]
    CHUNK_DOC --> GENERATE_VECTORS[Generate Embeddings]
    GENERATE_VECTORS --> UPLOAD_INDEX[Upload to Search Index]
    
    UPLOAD_INDEX --> UPDATE_DB[Update Database Status]
    UPDATE_DB --> MORE_TASKS{More Tasks?}
    
    MORE_TASKS -->|Yes| PROCESS_TASK
    MORE_TASKS -->|No| COMPLETE[Processing Complete]
    
    COMPLETE --> STATS[Generate Statistics]
    STATS --> END([End])
    
    style START fill:#c8e6c9
    style END fill:#ffcdd2
    style EXTRACT fill:#fff3e0
    style GENERATE_VECTORS fill:#e1f5fe
    style UPLOAD_INDEX fill:#f3e5f5

Functional Logic

The functional logic of the Document AI Indexer encompasses three main processing areas: document extraction, content chunking, and search indexing. Each area implements sophisticated algorithms to ensure high-quality output.

Design Principles for Document Processing

Format-Agnostic Processing: The system handles multiple document formats through a unified interface. Different extractors are used based on document type, but all produce a standardized Document object.

Intelligent Content Analysis: Before processing, the system analyzes document structure to determine the optimal processing strategy. This includes detecting header hierarchies, identifying figures and tables, and understanding document layout.

Quality Assurance: Each processing stage includes validation and quality checks. For example, the hierarchy fixer validates that document structure is logical and coherent before proceeding to chunking.

Metadata Preservation: Throughout the processing pipeline, important metadata is preserved and enriched. This includes document properties, processing timestamps, and structural information.

Document Extraction Logic

The document extraction logic is the foundation of the processing pipeline. It handles the complex task of converting various document formats into structured, searchable content while preserving important layout and formatting information.

Multi-Modal Processing: The system supports both traditional OCR-based extraction and advanced vision-language model processing. The choice of extraction method depends on document complexity and available resources.

Feature Detection: Azure Document Intelligence features are selectively enabled based on document characteristics and configuration. This includes high-resolution OCR for detailed documents, formula extraction for technical content, and figure extraction for visual elements.

Content Structure Preservation: The extraction process maintains document structure through markdown formatting, preserving headers, lists, tables, and other formatting elements that provide context for the content.

Error Handling and Fallbacks: If advanced extraction features fail, the system falls back to basic extraction methods to ensure that content is not lost due to processing errors.

flowchart TD
    DOC[Document Input] --> DETECT[Detect Format]
    
    DETECT --> PDF{PDF?}
    DETECT --> IMG{Image?}
    DETECT --> OFFICE{Office Doc?}
    DETECT --> TEXT{Text File?}
    
    PDF -->|Yes| DI_PDF[Azure DI Layout Model]
    IMG -->|Yes| RESIZE[Resize if Needed]
    OFFICE -->|Yes| CONVERT[Convert to Supported Format]
    TEXT -->|Yes| DIRECT[Direct Content Read]
    
    RESIZE --> DI_IMG[Azure DI OCR + Layout]
    CONVERT --> DI_OFFICE[Azure DI Document Analysis]
    
    DI_PDF --> FEATURES[Apply DI Features]
    DI_IMG --> FEATURES
    DI_OFFICE --> FEATURES
    
    FEATURES --> HIGH_RES{High Resolution OCR?}
    FEATURES --> FORMULAS{Extract Formulas?}
    FEATURES --> FIGURES{Extract Figures?}
    
    HIGH_RES -->|Yes| ENABLE_HIRES[Enable High-Res OCR]
    FORMULAS -->|Yes| ENABLE_FORMULAS[Enable Formula Extraction]
    FIGURES -->|Yes| ENABLE_FIGURES[Enable Figure Extraction]
    
    ENABLE_HIRES --> PROCESS_DI[Process with Azure DI]
    ENABLE_FORMULAS --> PROCESS_DI
    ENABLE_FIGURES --> PROCESS_DI
    HIGH_RES -->|No| PROCESS_DI
    FORMULAS -->|No| PROCESS_DI
    FIGURES -->|No| PROCESS_DI
    
    DIRECT --> EXTRACT_META[Extract Metadata]
    PROCESS_DI --> EXTRACT_CONTENT[Extract Content + Structure]
    
    EXTRACT_CONTENT --> EXTRACT_META
    EXTRACT_META --> RESULT[Document Object]
    
    style DOC fill:#e3f2fd
    style RESULT fill:#c8e6c9
    style PROCESS_DI fill:#fff3e0

Chunking Strategy

The chunking strategy is critical for creating meaningful, searchable segments from large documents. The system implements intelligent chunking that respects document structure while maintaining optimal chunk sizes for search and retrieval.

Hierarchy-Aware Chunking: The system analyzes document structure and uses markdown headers to create logical chunks. This ensures that related content stays together and that chunks maintain contextual coherence.

Adaptive Chunking: Chunk boundaries are determined by both content structure and token limits. The system balances the need for complete thoughts with search engine constraints.

Overlap Strategy: Configurable token overlap between chunks ensures that important information at chunk boundaries is not lost during retrieval operations.

Token Management: Precise token counting using tiktoken ensures that chunks stay within specified limits while maximizing content density.

flowchart TD
    CONTENT[Extracted Content] --> HIERARCHY_FIX{Apply Hierarchy Fix?}
    
    HIERARCHY_FIX -->|Yes| FIX[Fix Header Hierarchy]
    HIERARCHY_FIX -->|No| CHUNK_STRATEGY[Determine Chunking Strategy]
    
    FIX --> ANALYZE[Analyze Document Structure]
    ANALYZE --> CHUNK_STRATEGY
    
    CHUNK_STRATEGY --> MARKDOWN{Markdown Headers?}
    CHUNK_STRATEGY --> RECURSIVE{Use Recursive Split?}
    
    MARKDOWN -->|Yes| HEADER_SPLIT[Markdown Header Splitter]
    MARKDOWN -->|No| RECURSIVE
    RECURSIVE -->|Yes| CHAR_SPLIT[Recursive Character Splitter]
    
    HEADER_SPLIT --> CONFIG[Apply Chunk Configuration]
    CHAR_SPLIT --> CONFIG
    
    CONFIG --> SIZE[Chunk Size: 2048 tokens]
    CONFIG --> OVERLAP[Token Overlap: 128]
    
    SIZE --> SPLIT[Split Document]
    OVERLAP --> SPLIT
    
    SPLIT --> VALIDATE[Validate Chunk Sizes]
    VALIDATE --> METADATA[Add Chunk Metadata]
    
    METADATA --> RESULT[Chunked Documents]
    
    style CONTENT fill:#e3f2fd
    style RESULT fill:#c8e6c9
    style FIX fill:#fff3e0
    style SPLIT fill:#f3e5f5

Indexing and Search Integration

The indexing and search integration component handles the final stage of the processing pipeline, converting processed documents into searchable vector representations and uploading them to Azure AI Search.

Vector Generation: The system generates high-quality embeddings using Azure OpenAI services. Multiple vector fields can be configured to support different search scenarios (content-based, metadata-based, etc.).

Batch Processing: Documents are processed in configurable batches to optimize upload performance and manage API rate limits effectively.

Schema Management: The system automatically creates and manages search index schemas based on configuration files, ensuring that all required fields and vector configurations are properly set up.

Error Recovery: Failed uploads are tracked and retried, with detailed logging to help diagnose and resolve issues. The system can recover from partial batch failures without losing processed content.

flowchart TD
    CHUNKS[Document Chunks] --> EMBED[Generate Embeddings]
    
    EMBED --> OPENAI[Azure OpenAI API]
    OPENAI --> VECTORS[Vector Embeddings]
    
    VECTORS --> PREPARE[Prepare Index Documents]
    PREPARE --> METADATA[Add Metadata Fields]
    
    METADATA --> CUSTOM[Add Custom Fields]
    CUSTOM --> BATCH[Create Upload Batches]
    
    BATCH --> SIZE[Batch Size: 50 docs]
    SIZE --> UPLOAD[Upload to Azure AI Search]
    
    UPLOAD --> SUCCESS{Upload Successful?}
    SUCCESS -->|Yes| UPDATE_STATUS[Update Success Status]
    SUCCESS -->|No| RETRY[Retry Upload]
    
    RETRY --> MAX_RETRIES{Max Retries Reached?}
    MAX_RETRIES -->|No| UPLOAD
    MAX_RETRIES -->|Yes| ERROR[Mark as Failed]
    
    UPDATE_STATUS --> NEXT_BATCH{More Batches?}
    NEXT_BATCH -->|Yes| BATCH
    NEXT_BATCH -->|No| COMPLETE[Index Complete]
    
    ERROR --> LOG[Log Error Details]
    LOG --> COMPLETE
    
    style CHUNKS fill:#e3f2fd
    style COMPLETE fill:#c8e6c9
    style EMBED fill:#fff3e0
    style UPLOAD fill:#f3e5f5
    style ERROR fill:#ffcdd2

Database Schema

The database schema is designed to support scalable document processing operations while maintaining data integrity and enabling efficient querying. The schema tracks processing state, manages job coordination, and provides audit trails.

Design Rationale

Composite Primary Keys: The IndexObject table uses composite primary keys (object_key, datasource_name) to support multi-tenant scenarios where the same document might exist in different data sources.

State Tracking: Detailed status tracking allows the system to resume processing after failures and provides visibility into processing progress and issues.

Audit Trail: Comprehensive timestamp tracking and detailed message logging provide full audit trails for compliance and debugging purposes.

Job Coordination: The IndexJob table enables coordination of processing jobs across multiple instances and provides reporting on job completion and success rates.

Core Entities

erDiagram
    IndexObject {
        string object_key PK
        string datasource_name PK
        string type
        string status
        datetime created_time
        datetime updated_time
        datetime last_start_time
        datetime last_finished_time
        int try_count
        int last_run_id
        text detailed_message
        text error_message
        text last_message
    }
    
    IndexJob {
        int id PK
        string datasource_name
        string status
        datetime start_time
        datetime end_time
        int total_files
        int processed_files
        int failed_files
        int skipped_files
        text config_snapshot
        text error_message
    }
    
    IndexObject ||--o{ IndexJob : belongs_to

Configuration Management

The configuration management system is designed to support flexible deployment across different environments while maintaining security and ease of management. The system separates business configuration from sensitive credentials and provides environment-specific overrides.

Configuration Strategy

Separation of Concerns: Business logic configuration (data sources, processing parameters) is separated from sensitive credentials (API keys, connection strings) to enable secure deployment practices.

Environment-Specific Configuration: The system supports multiple configuration files that can be combined to create environment-specific deployments without duplicating common settings.

Validation and Defaults: Configuration values are validated at startup, and sensible defaults are provided to minimize required configuration while ensuring the system operates correctly.

Dynamic Reconfiguration: Many configuration parameters can be modified without requiring application restarts, enabling operational flexibility and optimization.

Configuration Structure

mindmap
  root((Configuration))
    Data Sources
      Blob Storage
        SAS Tokens
        Container Paths
      Local Files
        Directory Paths
        File Filters
    Processing
      Chunk Size
      Token Overlap
      Batch Sizes
      Retry Limits
    AI Services
      Azure Document Intelligence
        Endpoint
        API Key
        Features
      Azure OpenAI
        Endpoint
        API Key
        Model Settings
    Database
      Connection String
      Connection Pool
    Index Schemas
      Field Mappings
      Vector Configurations
      Search Index Settings

Deployment Architecture

The deployment architecture is designed for cloud-native operations with support for both batch processing and continuous operation modes. The system leverages Kubernetes for orchestration and scaling while maintaining compatibility with various deployment scenarios.

Cloud-Native Design Principles

Containerization: The application is fully containerized, enabling consistent deployment across different environments and easy scaling based on demand.

Stateless Processing: Processing pods are designed to be stateless, with all persistent state managed through external databases and storage services. This enables horizontal scaling and fault tolerance.

Configuration Externalization: All configuration is externalized through ConfigMaps and Secrets, allowing for environment-specific configuration without rebuilding container images.

Resource Management: The deployment configuration includes resource limits and requests to ensure proper resource allocation and prevent resource contention in multi-tenant environments.

Scaling Strategy

Horizontal Pod Autoscaling: The system can automatically scale the number of processing pods based on CPU utilization, memory usage, or custom metrics like queue depth.

Job-Based Processing: For batch operations, the system uses Kubernetes Jobs and CronJobs to ensure processing completion and automatic cleanup of completed jobs.

Load Distribution: Multiple pods process documents in parallel, with work distribution managed through the database-backed task queue system.

Kubernetes Deployment

graph TB
    subgraph "Kubernetes Cluster"
        subgraph "Namespace: document-ai"
            POD1[Document Processor Pod 1]
            POD2[Document Processor Pod 2]
            POD3[Document Processor Pod N]
            
            CM[ConfigMap<br/>config.yaml]
            SECRET[Secret<br/>env.yaml]
            
            PVC[PersistentVolumeClaim<br/>Temp Storage]
        end
        
        subgraph "Services"
            SVC[LoadBalancer Service]
            CRON[CronJob Controller]
        end
    end
    
    subgraph "External Services"
        AZURE_DI[Azure Document Intelligence]
        AZURE_OPENAI[Azure OpenAI]
        AZURE_SEARCH[Azure AI Search]
        AZURE_STORAGE[Azure Blob Storage]
        DATABASE[(Database)]
    end
    
    CM --> POD1
    CM --> POD2
    CM --> POD3
    
    SECRET --> POD1
    SECRET --> POD2
    SECRET --> POD3
    
    PVC --> POD1
    PVC --> POD2
    PVC --> POD3
    
    SVC --> POD1
    SVC --> POD2
    SVC --> POD3
    
    CRON --> POD1
    
    POD1 --> AZURE_DI
    POD1 --> AZURE_OPENAI
    POD1 --> AZURE_SEARCH
    POD1 --> AZURE_STORAGE
    POD1 --> DATABASE
    
    POD2 --> AZURE_DI
    POD2 --> AZURE_OPENAI
    POD2 --> AZURE_SEARCH
    POD2 --> AZURE_STORAGE
    POD2 --> DATABASE
    
    POD3 --> AZURE_DI
    POD3 --> AZURE_OPENAI
    POD3 --> AZURE_SEARCH
    POD3 --> AZURE_STORAGE
    POD3 --> DATABASE
    
    style POD1 fill:#e1f5fe
    style POD2 fill:#e1f5fe
    style POD3 fill:#e1f5fe
    style CM fill:#fff3e0
    style SECRET fill:#ffebee

Performance and Scalability

The system is designed to handle large-scale document processing operations efficiently while maintaining high quality output. Performance optimization occurs at multiple levels: application design, resource utilization, and operational practices.

Performance Optimization Strategies

Asynchronous Processing: All I/O-bound operations are implemented asynchronously to maximize throughput and resource utilization. This is particularly important for operations involving external API calls and database operations.

Connection Pooling: Database and HTTP connections are pooled and reused to minimize connection overhead and improve response times.

Caching Strategies: Frequently accessed configuration data and metadata are cached in memory to reduce database load and improve response times.

Batch Operations: Operations that can be batched (such as database writes and API calls) are grouped together to reduce overhead and improve efficiency.

Scalability Considerations

Horizontal Scaling: The stateless design of processing components enables horizontal scaling by adding more processing instances without architectural changes.

Database Optimization: Database operations are optimized through proper indexing, connection pooling, and efficient query patterns to support high-concurrency operations.

Rate Limiting and Throttling: The system implements rate limiting and throttling mechanisms to respect external service limits while maintaining optimal throughput.

Resource Monitoring: Comprehensive monitoring of resource utilization enables proactive scaling decisions and performance optimization.

Processing Pipeline Performance

graph LR
    subgraph "Performance Metrics"
        TPS[Throughput<br/>Documents/Second]
        LAT[Latency<br/>Processing Time]
        ERR[Error Rate<br/>Failed Documents]
        RES[Resource Usage<br/>CPU/Memory]
    end
    
    subgraph "Optimization Strategies"
        ASYNC[Async Processing]
        BATCH[Batch Operations]
        CACHE[Caching Layer]
        RETRY[Retry Logic]
    end
    
    subgraph "Scaling Options"
        HSCALE[Horizontal Scaling<br/>More Pods]
        VSCALE[Vertical Scaling<br/>Larger Pods]
        QUEUE[Queue Management<br/>Task Distribution]
    end
    
    TPS --> ASYNC
    LAT --> BATCH
    ERR --> RETRY
    RES --> CACHE
    
    ASYNC --> HSCALE
    BATCH --> QUEUE
    CACHE --> VSCALE
    
    style TPS fill:#c8e6c9
    style LAT fill:#fff3e0
    style ERR fill:#ffcdd2
    style RES fill:#e1f5fe

Error Handling and Monitoring

The error handling and monitoring system is designed to provide comprehensive visibility into system operations while implementing robust recovery mechanisms. The system distinguishes between different types of errors and responds appropriately to each.

Error Classification and Response

Transient Errors: Network timeouts, temporary service unavailability, and rate limiting are handled through exponential backoff retry mechanisms. These errors are expected in distributed systems and are handled automatically.

Configuration Errors: Invalid configuration values, missing credentials, and similar issues are detected at startup and cause immediate failure with clear error messages to facilitate quick resolution.

Resource Errors: Insufficient disk space, memory exhaustion, and similar resource constraints are detected and handled gracefully, often by pausing processing until resources become available.

Service Errors: Failures in external services (Azure Document Intelligence, Azure OpenAI, etc.) are handled through fallback mechanisms where possible, or graceful degradation when fallbacks are not available.

Monitoring and Observability

Structured Logging: All log messages follow a structured format that enables efficient searching and analysis. Log levels are used appropriately to balance information content with log volume.

Processing Metrics: Key performance indicators such as processing rates, error rates, and resource utilization are tracked and can be exported to monitoring systems.

Health Checks: The system implements health check endpoints that can be used by orchestration systems to determine system health and restart unhealthy instances.

Audit Trails: Complete audit trails of document processing operations are maintained for compliance and debugging purposes.

Error Handling Strategy

flowchart TD
    ERROR[Error Detected] --> CLASSIFY[Classify Error Type]
    
    CLASSIFY --> TRANSIENT{Transient Error?}
    CLASSIFY --> CONFIG{Configuration Error?}
    CLASSIFY --> RESOURCE{Resource Error?}
    CLASSIFY --> SERVICE{Service Error?}
    
    TRANSIENT -->|Yes| RETRY[Retry with Backoff]
    CONFIG -->|Yes| LOG_FATAL[Log Fatal Error]
    RESOURCE -->|Yes| WAIT[Wait for Resources]
    SERVICE -->|Yes| CHECK_SERVICE[Check Service Status]
    
    RETRY --> MAX_RETRY{Max Retries?}
    MAX_RETRY -->|No| ATTEMPT[Retry Attempt]
    MAX_RETRY -->|Yes| MARK_FAILED[Mark as Failed]
    
    ATTEMPT --> SUCCESS{Success?}
    SUCCESS -->|Yes| UPDATE_SUCCESS[Update Success]
    SUCCESS -->|No| RETRY
    
    WAIT --> RESOURCE_CHECK{Resources Available?}
    RESOURCE_CHECK -->|Yes| RETRY
    RESOURCE_CHECK -->|No| WAIT
    
    CHECK_SERVICE --> SERVICE_OK{Service OK?}
    SERVICE_OK -->|Yes| RETRY
    SERVICE_OK -->|No| ESCALATE[Escalate Error]
    
    LOG_FATAL --> STOP[Stop Processing]
    MARK_FAILED --> LOG_ERROR[Log Detailed Error]
    ESCALATE --> LOG_ERROR
    
    UPDATE_SUCCESS --> CONTINUE[Continue Processing]
    LOG_ERROR --> CONTINUE
    
    style ERROR fill:#ffcdd2
    style UPDATE_SUCCESS fill:#c8e6c9
    style CONTINUE fill:#e8f5e8

Conclusion

The Document AI Indexer provides a comprehensive, scalable solution for intelligent document processing and indexing. Its modular architecture, robust error handling, and integration with Azure AI services make it suitable for enterprise-scale document processing workflows. The system's flexibility allows for easy customization and extension to meet specific business requirements while maintaining high performance and reliability.

38 KiB Raw Permalink Blame History