init
This commit is contained in:
841
vw-document-ai-indexer/docs/design.md
Normal file
841
vw-document-ai-indexer/docs/design.md
Normal file
@@ -0,0 +1,841 @@
|
||||
# Document AI Indexer - Design Document
|
||||
|
||||
## Overview
|
||||
|
||||
The Document AI Indexer is an intelligent document processing and indexing system built on Azure AI services. It provides comprehensive document extraction, processing, and vectorized indexing capabilities for multiple document formats, enabling advanced search and retrieval functionality.
|
||||
|
||||
### Design Philosophy
|
||||
|
||||
The system is designed with several key principles in mind:
|
||||
|
||||
**Modularity and Separation of Concerns**: The architecture follows a layered approach with clear separation between application logic, business logic, service layer, and data access. This ensures maintainability and allows for easy testing and modification of individual components.
|
||||
|
||||
**Scalability and Performance**: Built with asynchronous processing capabilities and horizontal scaling in mind. The system can handle large volumes of documents through configurable parallel processing and efficient resource utilization.
|
||||
|
||||
**Resilience and Fault Tolerance**: Implements comprehensive error handling, retry mechanisms, and graceful degradation to ensure reliable operation even when external services experience issues.
|
||||
|
||||
**Configuration-Driven Architecture**: Utilizes YAML-based configuration management that allows for flexible deployment across different environments without code changes.
|
||||
|
||||
**Cloud-Native Design**: Leverages Azure services for AI processing, storage, and search capabilities while maintaining vendor independence through abstraction layers.
|
||||
|
||||
## Features
|
||||
|
||||
### 🚀 Core Features
|
||||
|
||||
- **Multi-format Document Support**: Handles PDF, DOCX, images (JPEG, PNG, TIFF, etc.), and other document formats
|
||||
- **Intelligent Content Extraction**: Leverages Azure Document Intelligence for OCR and structured data extraction
|
||||
- **Smart Document Chunking**: Implements hierarchy-aware chunking with configurable token limits and overlap
|
||||
- **Vector Search Integration**: Automatic Azure AI Search index creation and document vectorization
|
||||
- **Metadata Management**: Complete extraction and management of document metadata and custom fields
|
||||
- **Hierarchy Structure Repair**: Automatic correction of title hierarchy structure in Markdown documents
|
||||
- **Figure and Formula Extraction**: Advanced extraction of visual elements and mathematical formulas
|
||||
|
||||
### 🔧 Technical Features
|
||||
|
||||
- **Asynchronous Processing**: High-performance async processing using asyncio and task queues
|
||||
- **Containerized Deployment**: Complete Docker and Kubernetes support with configurable environments
|
||||
- **Configuration Management**: Flexible YAML-based configuration for different deployment scenarios
|
||||
- **Database Support**: SQLAlchemy ORM with support for multiple database backends
|
||||
- **Resilient Processing**: Built-in retry mechanisms, error handling, and fault tolerance
|
||||
- **Monitoring & Logging**: Comprehensive logging, progress monitoring, and processing statistics
|
||||
- **Scalable Architecture**: Horizontal scaling support through containerization and task distribution
|
||||
|
||||
## System Architecture
|
||||
|
||||
The Document AI Indexer follows a multi-layered architecture designed for scalability, maintainability, and robust error handling. The system processes documents through a well-defined pipeline that transforms raw documents into searchable, vectorized content.
|
||||
|
||||
### Architectural Patterns
|
||||
|
||||
**Service Factory Pattern**: The system uses a centralized ServiceFactory to manage dependencies and service creation. This pattern ensures consistent configuration across all services and enables easy testing through dependency injection.
|
||||
|
||||
**Repository Pattern**: Data access is abstracted through repository interfaces, allowing for different storage backends and simplified testing with mock implementations.
|
||||
|
||||
**Command Pattern**: Document processing tasks are encapsulated as commands that can be queued, retried, and executed asynchronously.
|
||||
|
||||
**Pipeline Pattern**: The document processing workflow follows a clear pipeline with distinct stages: extraction, hierarchy fixing, chunking, vectorization, and indexing.
|
||||
|
||||
### High-Level Architecture
|
||||
|
||||
The high-level architecture represents a distributed, service-oriented system designed for scalable document processing and intelligent content extraction. The architecture emphasizes separation of concerns, fault tolerance, and cloud-native principles to handle enterprise-scale document processing workloads.
|
||||
|
||||
#### Architectural Overview
|
||||
|
||||
**Multi-Layered Design**: The system is organized into distinct functional layers that separate data ingestion, processing logic, AI services, and storage concerns. This layered approach enables independent scaling, testing, and maintenance of different system components.
|
||||
|
||||
**Service-Oriented Architecture**: Each major functional area is implemented as a distinct service or component group, enabling independent deployment, scaling, and maintenance. Services communicate through well-defined interfaces and can be replaced or upgraded independently.
|
||||
|
||||
**Cloud-Native Integration**: The architecture leverages Azure cloud services for AI processing, storage, and search capabilities while maintaining abstraction layers that enable portability and testing flexibility.
|
||||
|
||||
**Event-Driven Processing**: The system follows an event-driven model where document processing is triggered by events (new documents, configuration changes, etc.) and progresses through a series of processing stages with clear state transitions.
|
||||
|
||||
#### System Components and Responsibilities
|
||||
|
||||
**Data Sources Layer**: Manages document ingestion from various sources including Azure Blob Storage and local file systems. This layer handles authentication, access control, and metadata extraction from source systems. It provides a unified interface for document discovery regardless of the underlying storage mechanism.
|
||||
|
||||
**Processing Engine Layer**: Orchestrates the entire document processing workflow through a hierarchical task management system. The Main Application serves as the central coordinator, while the Task Processor manages work distribution and the Document Task Processor handles individual document processing operations with full state tracking and error recovery.
|
||||
|
||||
**AI Services Layer**: Provides intelligent document processing capabilities through integration with Azure AI services and optional Vision LLM systems. These services handle complex operations like OCR, layout analysis, content extraction, and embedding generation. The modular design allows for easy integration of additional AI services or replacement of existing ones.
|
||||
|
||||
**Processing Pipeline Layer**: Implements the core document transformation logic through a series of processing stages. Each stage has specific responsibilities: content extraction converts raw documents to structured text, hierarchy fixing normalizes document structure, chunking creates manageable content segments, and vector generation produces searchable embeddings.
|
||||
|
||||
**Storage & Search Layer**: Manages persistent data storage and search capabilities through a combination of relational database storage for metadata and state management, Azure AI Search for vector-based content search, and blob storage for processed content and temporary files.
|
||||
|
||||
#### Data Flow and Integration Patterns
|
||||
|
||||
**Asynchronous Processing Flow**: Documents flow through the system asynchronously, enabling high throughput and efficient resource utilization. Each processing stage can operate independently, with clear handoff points and state persistence between stages.
|
||||
|
||||
**Fault-Tolerant Design**: The architecture includes comprehensive error handling and recovery mechanisms at every level. Failed operations are tracked, logged, and can be retried with exponential backoff. The system maintains processing state to enable recovery from failures without losing work.
|
||||
|
||||
**Scalability Patterns**: The architecture supports both vertical and horizontal scaling through stateless processing components, connection pooling, and queue-based work distribution. Different components can be scaled independently based on their specific resource requirements and bottlenecks.
|
||||
|
||||
**Configuration-Driven Behavior**: The system behavior is largely controlled through configuration rather than code changes, enabling flexible deployment across different environments and use cases without requiring code modifications or redeployment.
|
||||
|
||||
```mermaid
|
||||
graph TB
|
||||
subgraph "Data Sources"
|
||||
DS[Document Sources<br/>Azure Blob Storage/Local Files]
|
||||
META[Metadata<br/>Configuration]
|
||||
end
|
||||
|
||||
subgraph "Processing Engine"
|
||||
MAIN[Main Application<br/>Orchestrator]
|
||||
TP[Task Processor<br/>Queue Management]
|
||||
DTP[Document Task<br/>Processor]
|
||||
end
|
||||
|
||||
subgraph "AI Services"
|
||||
ADI[Azure Document<br/>Intelligence]
|
||||
EMBED[Embedding<br/>Service]
|
||||
VLLM[Vision LLM<br/>Optional]
|
||||
end
|
||||
|
||||
subgraph "Processing Pipeline"
|
||||
EXTRACT[Content<br/>Extraction]
|
||||
HIERARCHY[Hierarchy<br/>Fix]
|
||||
CHUNK[Document<br/>Chunking]
|
||||
VECTOR[Vector<br/>Generation]
|
||||
end
|
||||
|
||||
subgraph "Storage & Search"
|
||||
DB[(Database<br/>SQLAlchemy)]
|
||||
AAS[Azure AI Search<br/>Index]
|
||||
BLOB[Azure Blob<br/>Storage]
|
||||
end
|
||||
|
||||
DS --> MAIN
|
||||
META --> MAIN
|
||||
MAIN --> TP
|
||||
TP --> DTP
|
||||
DTP --> EXTRACT
|
||||
|
||||
EXTRACT --> ADI
|
||||
EXTRACT --> VLLM
|
||||
ADI --> HIERARCHY
|
||||
HIERARCHY --> CHUNK
|
||||
CHUNK --> VECTOR
|
||||
VECTOR --> EMBED
|
||||
|
||||
DTP --> DB
|
||||
VECTOR --> AAS
|
||||
EXTRACT --> BLOB
|
||||
|
||||
style DS fill:#e1f5fe
|
||||
style AI fill:#f3e5f5
|
||||
style STORAGE fill:#e8f5e8
|
||||
```
|
||||
|
||||
### Component Architecture
|
||||
|
||||
The component architecture illustrates the internal structure and dependencies between different layers of the system. Each layer has specific responsibilities and communicates through well-defined interfaces.
|
||||
|
||||
**Application Layer**: Handles application initialization, configuration loading, and high-level orchestration. The ApplicationContext manages the overall application state and provides access to configuration and services.
|
||||
|
||||
**Business Layer**: Contains the core business logic for document processing. The DocumentProcessingOrchestrator coordinates the entire processing workflow, while the DocumentProcessor handles individual document processing tasks.
|
||||
|
||||
**Service Layer**: Provides abstracted access to external services and resources. The ServiceFactory manages service creation and configuration, ensuring consistent behavior across the application.
|
||||
|
||||
**Data Layer**: Manages data persistence and retrieval through repository patterns and entity models. This layer abstracts database operations and provides a clean interface for data access.
|
||||
|
||||
```mermaid
|
||||
graph LR
|
||||
subgraph "Application Layer"
|
||||
APP[DocumentProcessingApplication]
|
||||
CTX[ApplicationContext]
|
||||
CONFIG[ApplicationConfig]
|
||||
end
|
||||
|
||||
subgraph "Business Layer"
|
||||
BL[Business Layer]
|
||||
ORCH[DocumentProcessingOrchestrator]
|
||||
PROC[DocumentProcessor]
|
||||
FACTORY[DocumentProcessingFactory]
|
||||
end
|
||||
|
||||
subgraph "Service Layer"
|
||||
SF[ServiceFactory]
|
||||
DI[DocumentIntelligenceService]
|
||||
CHUNK[ChunkService]
|
||||
INDEX[AzureIndexService]
|
||||
BLOB[BlobService]
|
||||
end
|
||||
|
||||
subgraph "Data Layer"
|
||||
DB[DatabaseInterface]
|
||||
REPO[DocumentRepository]
|
||||
MODELS[Entity Models]
|
||||
end
|
||||
|
||||
APP --> BL
|
||||
CTX --> CONFIG
|
||||
APP --> CTX
|
||||
|
||||
BL --> SF
|
||||
ORCH --> PROC
|
||||
FACTORY --> ORCH
|
||||
|
||||
SF --> DI
|
||||
SF --> CHUNK
|
||||
SF --> INDEX
|
||||
SF --> BLOB
|
||||
|
||||
PROC --> DB
|
||||
DB --> REPO
|
||||
REPO --> MODELS
|
||||
|
||||
style APP fill:#bbdefb
|
||||
style BL fill:#c8e6c9
|
||||
style SF fill:#ffecb3
|
||||
style DB fill:#f8bbd9
|
||||
```
|
||||
|
||||
## Workflow
|
||||
|
||||
The document processing workflow is designed to handle large-scale document processing with fault tolerance and efficient resource utilization. The system processes documents asynchronously through a task-based architecture.
|
||||
|
||||
### Processing Strategy
|
||||
|
||||
**Asynchronous Task Processing**: Documents are processed as individual tasks that can be executed in parallel. This approach maximizes throughput and allows for efficient resource utilization across multiple processing nodes.
|
||||
|
||||
**Stateful Processing**: Each document's processing state is tracked in the database, enabling recovery from failures and preventing duplicate processing. The system maintains detailed status information and processing history.
|
||||
|
||||
**Batch Operations**: Where possible, operations are batched to improve efficiency. This is particularly important for operations like embedding generation and search index uploads.
|
||||
|
||||
**Retry Logic**: Failed operations are automatically retried with exponential backoff. The system distinguishes between transient failures (which should be retried) and permanent failures (which should be logged and skipped).
|
||||
|
||||
### Document Processing Workflow
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant USER as User/Scheduler
|
||||
participant MAIN as Main App
|
||||
participant TP as Task Processor
|
||||
participant DTP as Document Task Processor
|
||||
participant ORCH as Orchestrator
|
||||
participant ADI as Azure DI
|
||||
participant CHUNK as Chunk Service
|
||||
participant INDEX as Index Service
|
||||
participant DB as Database
|
||||
|
||||
USER->>MAIN: Start Processing
|
||||
MAIN->>MAIN: Initialize Configuration
|
||||
MAIN->>DB: Initialize Database
|
||||
MAIN->>TP: Create Task Processor
|
||||
|
||||
loop For Each Document
|
||||
MAIN->>TP: Submit Document Task
|
||||
TP->>DTP: Process Task
|
||||
DTP->>DB: Create/Update IndexObject
|
||||
DTP->>ORCH: Execute Processing
|
||||
|
||||
ORCH->>ADI: Extract Document Content
|
||||
ADI-->>ORCH: Return Extracted Content
|
||||
|
||||
ORCH->>ORCH: Fix Hierarchy
|
||||
ORCH->>CHUNK: Chunk Document
|
||||
CHUNK-->>ORCH: Return Chunks
|
||||
|
||||
ORCH->>INDEX: Generate Embeddings
|
||||
INDEX-->>ORCH: Return Vectors
|
||||
|
||||
ORCH->>INDEX: Upload to Search Index
|
||||
INDEX-->>ORCH: Confirm Upload
|
||||
|
||||
ORCH-->>DTP: Return Processing Result
|
||||
DTP->>DB: Update IndexObject Status
|
||||
DTP-->>TP: Return Result
|
||||
end
|
||||
|
||||
TP-->>MAIN: Processing Complete
|
||||
MAIN-->>USER: Return Statistics
|
||||
```
|
||||
|
||||
### Data Flow Architecture
|
||||
|
||||
The data flow architecture represents the end-to-end processing pipeline from document ingestion to search index publication. This design emphasizes fault tolerance, scalability, and efficient resource utilization throughout the processing lifecycle.
|
||||
|
||||
#### Design Principles for Data Flow
|
||||
|
||||
**Pipeline-Based Processing**: The data flow follows a clear pipeline pattern where each stage has specific responsibilities and well-defined inputs and outputs. This design enables parallel processing, easier debugging, and modular testing of individual stages.
|
||||
|
||||
**Decision Points and Routing**: The architecture includes intelligent decision points that route documents through appropriate processing paths based on their characteristics. This ensures optimal processing strategies for different document types while maintaining a unified interface.
|
||||
|
||||
**State Management**: Processing state is carefully managed throughout the pipeline, with persistent state stored in the database and transient state maintained in memory. This approach enables recovery from failures at any point in the pipeline.
|
||||
|
||||
**Resource Optimization**: The flow is designed to minimize resource usage through efficient batching, connection reuse, and memory management. Processing stages are optimized to balance throughput with resource consumption.
|
||||
|
||||
#### Processing Flow Stages
|
||||
|
||||
**Initialization Phase**: The system performs comprehensive initialization including configuration validation, database connectivity checks, and service authentication. This phase ensures that all dependencies are available before processing begins.
|
||||
|
||||
**Discovery and Task Creation**: Document sources are scanned to identify new or modified documents that require processing. Tasks are created based on configured criteria such as file modification dates and processing history.
|
||||
|
||||
**Format Detection and Routing**: Documents are analyzed to determine their format and complexity, enabling the system to select the most appropriate extraction method. This intelligent routing ensures optimal processing quality and efficiency.
|
||||
|
||||
**Content Extraction**: Multiple extraction paths are available depending on document characteristics. The system can leverage Azure Document Intelligence for complex documents, Vision LLM for advanced image analysis, or direct processing for simple text documents.
|
||||
|
||||
**Content Enhancement**: Extracted content undergoes enhancement through hierarchy fixing and structure normalization. This stage ensures that the processed content maintains logical structure and is suitable for effective chunking.
|
||||
|
||||
**Vectorization and Indexing**: The final stages convert processed content into searchable vectors and upload them to the search index. These operations are batched for efficiency and include comprehensive error handling and retry logic.
|
||||
|
||||
```mermaid
|
||||
flowchart TD
|
||||
START([Start Processing]) --> INIT[Initialize Application]
|
||||
INIT --> LOAD_CONFIG[Load Configuration]
|
||||
LOAD_CONFIG --> INIT_DB[Initialize Database]
|
||||
INIT_DB --> SCAN_DOCS[Scan Document Sources]
|
||||
|
||||
SCAN_DOCS --> CREATE_TASKS[Create Processing Tasks]
|
||||
CREATE_TASKS --> PROCESS_TASK{Process Each Task}
|
||||
|
||||
PROCESS_TASK --> EXTRACT[Extract Content]
|
||||
EXTRACT --> CHECK_FORMAT{Check Document Format}
|
||||
|
||||
CHECK_FORMAT -->|PDF/Images| USE_DI[Use Azure Document Intelligence]
|
||||
CHECK_FORMAT -->|Vision Mode| USE_VLLM[Use Vision LLM]
|
||||
CHECK_FORMAT -->|Text| DIRECT_PROCESS[Direct Processing]
|
||||
|
||||
USE_DI --> EXTRACT_RESULT[Content + Metadata]
|
||||
USE_VLLM --> EXTRACT_RESULT
|
||||
DIRECT_PROCESS --> EXTRACT_RESULT
|
||||
|
||||
EXTRACT_RESULT --> FIX_HIERARCHY[Fix Document Hierarchy]
|
||||
FIX_HIERARCHY --> CHUNK_DOC[Chunk Document]
|
||||
CHUNK_DOC --> GENERATE_VECTORS[Generate Embeddings]
|
||||
GENERATE_VECTORS --> UPLOAD_INDEX[Upload to Search Index]
|
||||
|
||||
UPLOAD_INDEX --> UPDATE_DB[Update Database Status]
|
||||
UPDATE_DB --> MORE_TASKS{More Tasks?}
|
||||
|
||||
MORE_TASKS -->|Yes| PROCESS_TASK
|
||||
MORE_TASKS -->|No| COMPLETE[Processing Complete]
|
||||
|
||||
COMPLETE --> STATS[Generate Statistics]
|
||||
STATS --> END([End])
|
||||
|
||||
style START fill:#c8e6c9
|
||||
style END fill:#ffcdd2
|
||||
style EXTRACT fill:#fff3e0
|
||||
style GENERATE_VECTORS fill:#e1f5fe
|
||||
style UPLOAD_INDEX fill:#f3e5f5
|
||||
```
|
||||
|
||||
## Functional Logic
|
||||
|
||||
The functional logic of the Document AI Indexer encompasses three main processing areas: document extraction, content chunking, and search indexing. Each area implements sophisticated algorithms to ensure high-quality output.
|
||||
|
||||
### Design Principles for Document Processing
|
||||
|
||||
**Format-Agnostic Processing**: The system handles multiple document formats through a unified interface. Different extractors are used based on document type, but all produce a standardized Document object.
|
||||
|
||||
**Intelligent Content Analysis**: Before processing, the system analyzes document structure to determine the optimal processing strategy. This includes detecting header hierarchies, identifying figures and tables, and understanding document layout.
|
||||
|
||||
**Quality Assurance**: Each processing stage includes validation and quality checks. For example, the hierarchy fixer validates that document structure is logical and coherent before proceeding to chunking.
|
||||
|
||||
**Metadata Preservation**: Throughout the processing pipeline, important metadata is preserved and enriched. This includes document properties, processing timestamps, and structural information.
|
||||
|
||||
### Document Extraction Logic
|
||||
|
||||
The document extraction logic is the foundation of the processing pipeline. It handles the complex task of converting various document formats into structured, searchable content while preserving important layout and formatting information.
|
||||
|
||||
**Multi-Modal Processing**: The system supports both traditional OCR-based extraction and advanced vision-language model processing. The choice of extraction method depends on document complexity and available resources.
|
||||
|
||||
**Feature Detection**: Azure Document Intelligence features are selectively enabled based on document characteristics and configuration. This includes high-resolution OCR for detailed documents, formula extraction for technical content, and figure extraction for visual elements.
|
||||
|
||||
**Content Structure Preservation**: The extraction process maintains document structure through markdown formatting, preserving headers, lists, tables, and other formatting elements that provide context for the content.
|
||||
|
||||
**Error Handling and Fallbacks**: If advanced extraction features fail, the system falls back to basic extraction methods to ensure that content is not lost due to processing errors.
|
||||
|
||||
```mermaid
|
||||
flowchart TD
|
||||
DOC[Document Input] --> DETECT[Detect Format]
|
||||
|
||||
DETECT --> PDF{PDF?}
|
||||
DETECT --> IMG{Image?}
|
||||
DETECT --> OFFICE{Office Doc?}
|
||||
DETECT --> TEXT{Text File?}
|
||||
|
||||
PDF -->|Yes| DI_PDF[Azure DI Layout Model]
|
||||
IMG -->|Yes| RESIZE[Resize if Needed]
|
||||
OFFICE -->|Yes| CONVERT[Convert to Supported Format]
|
||||
TEXT -->|Yes| DIRECT[Direct Content Read]
|
||||
|
||||
RESIZE --> DI_IMG[Azure DI OCR + Layout]
|
||||
CONVERT --> DI_OFFICE[Azure DI Document Analysis]
|
||||
|
||||
DI_PDF --> FEATURES[Apply DI Features]
|
||||
DI_IMG --> FEATURES
|
||||
DI_OFFICE --> FEATURES
|
||||
|
||||
FEATURES --> HIGH_RES{High Resolution OCR?}
|
||||
FEATURES --> FORMULAS{Extract Formulas?}
|
||||
FEATURES --> FIGURES{Extract Figures?}
|
||||
|
||||
HIGH_RES -->|Yes| ENABLE_HIRES[Enable High-Res OCR]
|
||||
FORMULAS -->|Yes| ENABLE_FORMULAS[Enable Formula Extraction]
|
||||
FIGURES -->|Yes| ENABLE_FIGURES[Enable Figure Extraction]
|
||||
|
||||
ENABLE_HIRES --> PROCESS_DI[Process with Azure DI]
|
||||
ENABLE_FORMULAS --> PROCESS_DI
|
||||
ENABLE_FIGURES --> PROCESS_DI
|
||||
HIGH_RES -->|No| PROCESS_DI
|
||||
FORMULAS -->|No| PROCESS_DI
|
||||
FIGURES -->|No| PROCESS_DI
|
||||
|
||||
DIRECT --> EXTRACT_META[Extract Metadata]
|
||||
PROCESS_DI --> EXTRACT_CONTENT[Extract Content + Structure]
|
||||
|
||||
EXTRACT_CONTENT --> EXTRACT_META
|
||||
EXTRACT_META --> RESULT[Document Object]
|
||||
|
||||
style DOC fill:#e3f2fd
|
||||
style RESULT fill:#c8e6c9
|
||||
style PROCESS_DI fill:#fff3e0
|
||||
```
|
||||
|
||||
### Chunking Strategy
|
||||
|
||||
The chunking strategy is critical for creating meaningful, searchable segments from large documents. The system implements intelligent chunking that respects document structure while maintaining optimal chunk sizes for search and retrieval.
|
||||
|
||||
**Hierarchy-Aware Chunking**: The system analyzes document structure and uses markdown headers to create logical chunks. This ensures that related content stays together and that chunks maintain contextual coherence.
|
||||
|
||||
**Adaptive Chunking**: Chunk boundaries are determined by both content structure and token limits. The system balances the need for complete thoughts with search engine constraints.
|
||||
|
||||
**Overlap Strategy**: Configurable token overlap between chunks ensures that important information at chunk boundaries is not lost during retrieval operations.
|
||||
|
||||
**Token Management**: Precise token counting using tiktoken ensures that chunks stay within specified limits while maximizing content density.
|
||||
|
||||
```mermaid
|
||||
flowchart TD
|
||||
CONTENT[Extracted Content] --> HIERARCHY_FIX{Apply Hierarchy Fix?}
|
||||
|
||||
HIERARCHY_FIX -->|Yes| FIX[Fix Header Hierarchy]
|
||||
HIERARCHY_FIX -->|No| CHUNK_STRATEGY[Determine Chunking Strategy]
|
||||
|
||||
FIX --> ANALYZE[Analyze Document Structure]
|
||||
ANALYZE --> CHUNK_STRATEGY
|
||||
|
||||
CHUNK_STRATEGY --> MARKDOWN{Markdown Headers?}
|
||||
CHUNK_STRATEGY --> RECURSIVE{Use Recursive Split?}
|
||||
|
||||
MARKDOWN -->|Yes| HEADER_SPLIT[Markdown Header Splitter]
|
||||
MARKDOWN -->|No| RECURSIVE
|
||||
RECURSIVE -->|Yes| CHAR_SPLIT[Recursive Character Splitter]
|
||||
|
||||
HEADER_SPLIT --> CONFIG[Apply Chunk Configuration]
|
||||
CHAR_SPLIT --> CONFIG
|
||||
|
||||
CONFIG --> SIZE[Chunk Size: 2048 tokens]
|
||||
CONFIG --> OVERLAP[Token Overlap: 128]
|
||||
|
||||
SIZE --> SPLIT[Split Document]
|
||||
OVERLAP --> SPLIT
|
||||
|
||||
SPLIT --> VALIDATE[Validate Chunk Sizes]
|
||||
VALIDATE --> METADATA[Add Chunk Metadata]
|
||||
|
||||
METADATA --> RESULT[Chunked Documents]
|
||||
|
||||
style CONTENT fill:#e3f2fd
|
||||
style RESULT fill:#c8e6c9
|
||||
style FIX fill:#fff3e0
|
||||
style SPLIT fill:#f3e5f5
|
||||
```
|
||||
|
||||
### Indexing and Search Integration
|
||||
|
||||
The indexing and search integration component handles the final stage of the processing pipeline, converting processed documents into searchable vector representations and uploading them to Azure AI Search.
|
||||
|
||||
**Vector Generation**: The system generates high-quality embeddings using Azure OpenAI services. Multiple vector fields can be configured to support different search scenarios (content-based, metadata-based, etc.).
|
||||
|
||||
**Batch Processing**: Documents are processed in configurable batches to optimize upload performance and manage API rate limits effectively.
|
||||
|
||||
**Schema Management**: The system automatically creates and manages search index schemas based on configuration files, ensuring that all required fields and vector configurations are properly set up.
|
||||
|
||||
**Error Recovery**: Failed uploads are tracked and retried, with detailed logging to help diagnose and resolve issues. The system can recover from partial batch failures without losing processed content.
|
||||
|
||||
```mermaid
|
||||
flowchart TD
|
||||
CHUNKS[Document Chunks] --> EMBED[Generate Embeddings]
|
||||
|
||||
EMBED --> OPENAI[Azure OpenAI API]
|
||||
OPENAI --> VECTORS[Vector Embeddings]
|
||||
|
||||
VECTORS --> PREPARE[Prepare Index Documents]
|
||||
PREPARE --> METADATA[Add Metadata Fields]
|
||||
|
||||
METADATA --> CUSTOM[Add Custom Fields]
|
||||
CUSTOM --> BATCH[Create Upload Batches]
|
||||
|
||||
BATCH --> SIZE[Batch Size: 50 docs]
|
||||
SIZE --> UPLOAD[Upload to Azure AI Search]
|
||||
|
||||
UPLOAD --> SUCCESS{Upload Successful?}
|
||||
SUCCESS -->|Yes| UPDATE_STATUS[Update Success Status]
|
||||
SUCCESS -->|No| RETRY[Retry Upload]
|
||||
|
||||
RETRY --> MAX_RETRIES{Max Retries Reached?}
|
||||
MAX_RETRIES -->|No| UPLOAD
|
||||
MAX_RETRIES -->|Yes| ERROR[Mark as Failed]
|
||||
|
||||
UPDATE_STATUS --> NEXT_BATCH{More Batches?}
|
||||
NEXT_BATCH -->|Yes| BATCH
|
||||
NEXT_BATCH -->|No| COMPLETE[Index Complete]
|
||||
|
||||
ERROR --> LOG[Log Error Details]
|
||||
LOG --> COMPLETE
|
||||
|
||||
style CHUNKS fill:#e3f2fd
|
||||
style COMPLETE fill:#c8e6c9
|
||||
style EMBED fill:#fff3e0
|
||||
style UPLOAD fill:#f3e5f5
|
||||
style ERROR fill:#ffcdd2
|
||||
```
|
||||
|
||||
## Database Schema
|
||||
|
||||
The database schema is designed to support scalable document processing operations while maintaining data integrity and enabling efficient querying. The schema tracks processing state, manages job coordination, and provides audit trails.
|
||||
|
||||
### Design Rationale
|
||||
|
||||
**Composite Primary Keys**: The IndexObject table uses composite primary keys (object_key, datasource_name) to support multi-tenant scenarios where the same document might exist in different data sources.
|
||||
|
||||
**State Tracking**: Detailed status tracking allows the system to resume processing after failures and provides visibility into processing progress and issues.
|
||||
|
||||
**Audit Trail**: Comprehensive timestamp tracking and detailed message logging provide full audit trails for compliance and debugging purposes.
|
||||
|
||||
**Job Coordination**: The IndexJob table enables coordination of processing jobs across multiple instances and provides reporting on job completion and success rates.
|
||||
|
||||
### Core Entities
|
||||
|
||||
```mermaid
|
||||
erDiagram
|
||||
IndexObject {
|
||||
string object_key PK
|
||||
string datasource_name PK
|
||||
string type
|
||||
string status
|
||||
datetime created_time
|
||||
datetime updated_time
|
||||
datetime last_start_time
|
||||
datetime last_finished_time
|
||||
int try_count
|
||||
int last_run_id
|
||||
text detailed_message
|
||||
text error_message
|
||||
text last_message
|
||||
}
|
||||
|
||||
IndexJob {
|
||||
int id PK
|
||||
string datasource_name
|
||||
string status
|
||||
datetime start_time
|
||||
datetime end_time
|
||||
int total_files
|
||||
int processed_files
|
||||
int failed_files
|
||||
int skipped_files
|
||||
text config_snapshot
|
||||
text error_message
|
||||
}
|
||||
|
||||
IndexObject ||--o{ IndexJob : belongs_to
|
||||
```
|
||||
|
||||
## Configuration Management
|
||||
|
||||
The configuration management system is designed to support flexible deployment across different environments while maintaining security and ease of management. The system separates business configuration from sensitive credentials and provides environment-specific overrides.
|
||||
|
||||
### Configuration Strategy
|
||||
|
||||
**Separation of Concerns**: Business logic configuration (data sources, processing parameters) is separated from sensitive credentials (API keys, connection strings) to enable secure deployment practices.
|
||||
|
||||
**Environment-Specific Configuration**: The system supports multiple configuration files that can be combined to create environment-specific deployments without duplicating common settings.
|
||||
|
||||
**Validation and Defaults**: Configuration values are validated at startup, and sensible defaults are provided to minimize required configuration while ensuring the system operates correctly.
|
||||
|
||||
**Dynamic Reconfiguration**: Many configuration parameters can be modified without requiring application restarts, enabling operational flexibility and optimization.
|
||||
|
||||
### Configuration Structure
|
||||
|
||||
```mermaid
|
||||
mindmap
|
||||
root((Configuration))
|
||||
Data Sources
|
||||
Blob Storage
|
||||
SAS Tokens
|
||||
Container Paths
|
||||
Local Files
|
||||
Directory Paths
|
||||
File Filters
|
||||
Processing
|
||||
Chunk Size
|
||||
Token Overlap
|
||||
Batch Sizes
|
||||
Retry Limits
|
||||
AI Services
|
||||
Azure Document Intelligence
|
||||
Endpoint
|
||||
API Key
|
||||
Features
|
||||
Azure OpenAI
|
||||
Endpoint
|
||||
API Key
|
||||
Model Settings
|
||||
Database
|
||||
Connection String
|
||||
Connection Pool
|
||||
Index Schemas
|
||||
Field Mappings
|
||||
Vector Configurations
|
||||
Search Index Settings
|
||||
```
|
||||
|
||||
## Deployment Architecture
|
||||
|
||||
The deployment architecture is designed for cloud-native operations with support for both batch processing and continuous operation modes. The system leverages Kubernetes for orchestration and scaling while maintaining compatibility with various deployment scenarios.
|
||||
|
||||
### Cloud-Native Design Principles
|
||||
|
||||
**Containerization**: The application is fully containerized, enabling consistent deployment across different environments and easy scaling based on demand.
|
||||
|
||||
**Stateless Processing**: Processing pods are designed to be stateless, with all persistent state managed through external databases and storage services. This enables horizontal scaling and fault tolerance.
|
||||
|
||||
**Configuration Externalization**: All configuration is externalized through ConfigMaps and Secrets, allowing for environment-specific configuration without rebuilding container images.
|
||||
|
||||
**Resource Management**: The deployment configuration includes resource limits and requests to ensure proper resource allocation and prevent resource contention in multi-tenant environments.
|
||||
|
||||
### Scaling Strategy
|
||||
|
||||
**Horizontal Pod Autoscaling**: The system can automatically scale the number of processing pods based on CPU utilization, memory usage, or custom metrics like queue depth.
|
||||
|
||||
**Job-Based Processing**: For batch operations, the system uses Kubernetes Jobs and CronJobs to ensure processing completion and automatic cleanup of completed jobs.
|
||||
|
||||
**Load Distribution**: Multiple pods process documents in parallel, with work distribution managed through the database-backed task queue system.
|
||||
|
||||
### Kubernetes Deployment
|
||||
|
||||
```mermaid
|
||||
graph TB
|
||||
subgraph "Kubernetes Cluster"
|
||||
subgraph "Namespace: document-ai"
|
||||
POD1[Document Processor Pod 1]
|
||||
POD2[Document Processor Pod 2]
|
||||
POD3[Document Processor Pod N]
|
||||
|
||||
CM[ConfigMap<br/>config.yaml]
|
||||
SECRET[Secret<br/>env.yaml]
|
||||
|
||||
PVC[PersistentVolumeClaim<br/>Temp Storage]
|
||||
end
|
||||
|
||||
subgraph "Services"
|
||||
SVC[LoadBalancer Service]
|
||||
CRON[CronJob Controller]
|
||||
end
|
||||
end
|
||||
|
||||
subgraph "External Services"
|
||||
AZURE_DI[Azure Document Intelligence]
|
||||
AZURE_OPENAI[Azure OpenAI]
|
||||
AZURE_SEARCH[Azure AI Search]
|
||||
AZURE_STORAGE[Azure Blob Storage]
|
||||
DATABASE[(Database)]
|
||||
end
|
||||
|
||||
CM --> POD1
|
||||
CM --> POD2
|
||||
CM --> POD3
|
||||
|
||||
SECRET --> POD1
|
||||
SECRET --> POD2
|
||||
SECRET --> POD3
|
||||
|
||||
PVC --> POD1
|
||||
PVC --> POD2
|
||||
PVC --> POD3
|
||||
|
||||
SVC --> POD1
|
||||
SVC --> POD2
|
||||
SVC --> POD3
|
||||
|
||||
CRON --> POD1
|
||||
|
||||
POD1 --> AZURE_DI
|
||||
POD1 --> AZURE_OPENAI
|
||||
POD1 --> AZURE_SEARCH
|
||||
POD1 --> AZURE_STORAGE
|
||||
POD1 --> DATABASE
|
||||
|
||||
POD2 --> AZURE_DI
|
||||
POD2 --> AZURE_OPENAI
|
||||
POD2 --> AZURE_SEARCH
|
||||
POD2 --> AZURE_STORAGE
|
||||
POD2 --> DATABASE
|
||||
|
||||
POD3 --> AZURE_DI
|
||||
POD3 --> AZURE_OPENAI
|
||||
POD3 --> AZURE_SEARCH
|
||||
POD3 --> AZURE_STORAGE
|
||||
POD3 --> DATABASE
|
||||
|
||||
style POD1 fill:#e1f5fe
|
||||
style POD2 fill:#e1f5fe
|
||||
style POD3 fill:#e1f5fe
|
||||
style CM fill:#fff3e0
|
||||
style SECRET fill:#ffebee
|
||||
```
|
||||
|
||||
## Performance and Scalability
|
||||
|
||||
The system is designed to handle large-scale document processing operations efficiently while maintaining high quality output. Performance optimization occurs at multiple levels: application design, resource utilization, and operational practices.
|
||||
|
||||
### Performance Optimization Strategies
|
||||
|
||||
**Asynchronous Processing**: All I/O-bound operations are implemented asynchronously to maximize throughput and resource utilization. This is particularly important for operations involving external API calls and database operations.
|
||||
|
||||
**Connection Pooling**: Database and HTTP connections are pooled and reused to minimize connection overhead and improve response times.
|
||||
|
||||
**Caching Strategies**: Frequently accessed configuration data and metadata are cached in memory to reduce database load and improve response times.
|
||||
|
||||
**Batch Operations**: Operations that can be batched (such as database writes and API calls) are grouped together to reduce overhead and improve efficiency.
|
||||
|
||||
### Scalability Considerations
|
||||
|
||||
**Horizontal Scaling**: The stateless design of processing components enables horizontal scaling by adding more processing instances without architectural changes.
|
||||
|
||||
**Database Optimization**: Database operations are optimized through proper indexing, connection pooling, and efficient query patterns to support high-concurrency operations.
|
||||
|
||||
**Rate Limiting and Throttling**: The system implements rate limiting and throttling mechanisms to respect external service limits while maintaining optimal throughput.
|
||||
|
||||
**Resource Monitoring**: Comprehensive monitoring of resource utilization enables proactive scaling decisions and performance optimization.
|
||||
|
||||
### Processing Pipeline Performance
|
||||
|
||||
```mermaid
|
||||
graph LR
|
||||
subgraph "Performance Metrics"
|
||||
TPS[Throughput<br/>Documents/Second]
|
||||
LAT[Latency<br/>Processing Time]
|
||||
ERR[Error Rate<br/>Failed Documents]
|
||||
RES[Resource Usage<br/>CPU/Memory]
|
||||
end
|
||||
|
||||
subgraph "Optimization Strategies"
|
||||
ASYNC[Async Processing]
|
||||
BATCH[Batch Operations]
|
||||
CACHE[Caching Layer]
|
||||
RETRY[Retry Logic]
|
||||
end
|
||||
|
||||
subgraph "Scaling Options"
|
||||
HSCALE[Horizontal Scaling<br/>More Pods]
|
||||
VSCALE[Vertical Scaling<br/>Larger Pods]
|
||||
QUEUE[Queue Management<br/>Task Distribution]
|
||||
end
|
||||
|
||||
TPS --> ASYNC
|
||||
LAT --> BATCH
|
||||
ERR --> RETRY
|
||||
RES --> CACHE
|
||||
|
||||
ASYNC --> HSCALE
|
||||
BATCH --> QUEUE
|
||||
CACHE --> VSCALE
|
||||
|
||||
style TPS fill:#c8e6c9
|
||||
style LAT fill:#fff3e0
|
||||
style ERR fill:#ffcdd2
|
||||
style RES fill:#e1f5fe
|
||||
```
|
||||
|
||||
## Error Handling and Monitoring
|
||||
|
||||
The error handling and monitoring system is designed to provide comprehensive visibility into system operations while implementing robust recovery mechanisms. The system distinguishes between different types of errors and responds appropriately to each.
|
||||
|
||||
### Error Classification and Response
|
||||
|
||||
**Transient Errors**: Network timeouts, temporary service unavailability, and rate limiting are handled through exponential backoff retry mechanisms. These errors are expected in distributed systems and are handled automatically.
|
||||
|
||||
**Configuration Errors**: Invalid configuration values, missing credentials, and similar issues are detected at startup and cause immediate failure with clear error messages to facilitate quick resolution.
|
||||
|
||||
**Resource Errors**: Insufficient disk space, memory exhaustion, and similar resource constraints are detected and handled gracefully, often by pausing processing until resources become available.
|
||||
|
||||
**Service Errors**: Failures in external services (Azure Document Intelligence, Azure OpenAI, etc.) are handled through fallback mechanisms where possible, or graceful degradation when fallbacks are not available.
|
||||
|
||||
### Monitoring and Observability
|
||||
|
||||
**Structured Logging**: All log messages follow a structured format that enables efficient searching and analysis. Log levels are used appropriately to balance information content with log volume.
|
||||
|
||||
**Processing Metrics**: Key performance indicators such as processing rates, error rates, and resource utilization are tracked and can be exported to monitoring systems.
|
||||
|
||||
**Health Checks**: The system implements health check endpoints that can be used by orchestration systems to determine system health and restart unhealthy instances.
|
||||
|
||||
**Audit Trails**: Complete audit trails of document processing operations are maintained for compliance and debugging purposes.
|
||||
|
||||
### Error Handling Strategy
|
||||
|
||||
```mermaid
|
||||
flowchart TD
|
||||
ERROR[Error Detected] --> CLASSIFY[Classify Error Type]
|
||||
|
||||
CLASSIFY --> TRANSIENT{Transient Error?}
|
||||
CLASSIFY --> CONFIG{Configuration Error?}
|
||||
CLASSIFY --> RESOURCE{Resource Error?}
|
||||
CLASSIFY --> SERVICE{Service Error?}
|
||||
|
||||
TRANSIENT -->|Yes| RETRY[Retry with Backoff]
|
||||
CONFIG -->|Yes| LOG_FATAL[Log Fatal Error]
|
||||
RESOURCE -->|Yes| WAIT[Wait for Resources]
|
||||
SERVICE -->|Yes| CHECK_SERVICE[Check Service Status]
|
||||
|
||||
RETRY --> MAX_RETRY{Max Retries?}
|
||||
MAX_RETRY -->|No| ATTEMPT[Retry Attempt]
|
||||
MAX_RETRY -->|Yes| MARK_FAILED[Mark as Failed]
|
||||
|
||||
ATTEMPT --> SUCCESS{Success?}
|
||||
SUCCESS -->|Yes| UPDATE_SUCCESS[Update Success]
|
||||
SUCCESS -->|No| RETRY
|
||||
|
||||
WAIT --> RESOURCE_CHECK{Resources Available?}
|
||||
RESOURCE_CHECK -->|Yes| RETRY
|
||||
RESOURCE_CHECK -->|No| WAIT
|
||||
|
||||
CHECK_SERVICE --> SERVICE_OK{Service OK?}
|
||||
SERVICE_OK -->|Yes| RETRY
|
||||
SERVICE_OK -->|No| ESCALATE[Escalate Error]
|
||||
|
||||
LOG_FATAL --> STOP[Stop Processing]
|
||||
MARK_FAILED --> LOG_ERROR[Log Detailed Error]
|
||||
ESCALATE --> LOG_ERROR
|
||||
|
||||
UPDATE_SUCCESS --> CONTINUE[Continue Processing]
|
||||
LOG_ERROR --> CONTINUE
|
||||
|
||||
style ERROR fill:#ffcdd2
|
||||
style UPDATE_SUCCESS fill:#c8e6c9
|
||||
style CONTINUE fill:#e8f5e8
|
||||
```
|
||||
## Conclusion
|
||||
|
||||
The Document AI Indexer provides a comprehensive, scalable solution for intelligent document processing and indexing. Its modular architecture, robust error handling, and integration with Azure AI services make it suitable for enterprise-scale document processing workflows. The system's flexibility allows for easy customization and extension to meet specific business requirements while maintaining high performance and reliability.
|
||||
Reference in New Issue
Block a user