261 lines
7.8 KiB
Markdown
261 lines
7.8 KiB
Markdown
# Document AI Indexer
|
|
|
|
An intelligent document processing and indexing system based on Azure AI services, supporting content extraction, processing, and vectorized indexing for multiple document formats.
|
|
|
|
## Features
|
|
|
|
### 🚀 Core Features
|
|
- **Multi-format Document Support**: PDF, DOCX, image formats, etc.
|
|
- **Intelligent Content Extraction**: OCR and structured extraction using Azure Document Intelligence
|
|
- **Document Chunking**: Smart document chunking and vectorization
|
|
- **Azure AI Search Integration**: Automatically create search indexes and upload documents
|
|
- **Metadata Management**: Complete document metadata extraction and management
|
|
- **Hierarchy Structure Repair**: Automatically fix title hierarchy structure in Markdown documents
|
|
|
|
### 🔧 Technical Features
|
|
- **Asynchronous Processing**: High-performance async processing based on asyncio
|
|
- **Containerized Deployment**: Complete Docker and Kubernetes support
|
|
- **Configuration Management**: Flexible YAML configuration file management
|
|
- **Database Support**: SQLAlchemy ORM supporting multiple databases
|
|
- **Resilient Processing**: Built-in retry mechanisms and error handling
|
|
- **Monitoring & Logging**: Complete logging and progress monitoring
|
|
|
|
## System Architecture
|
|
|
|
```mermaid
|
|
graph LR
|
|
subgraph "Data Sources"
|
|
DS[Document Sources<br/>Blob Storage/Local]
|
|
MD[Metadata<br/>Extraction]
|
|
end
|
|
|
|
subgraph "Azure AI Services"
|
|
ADI[Azure Document<br/>Intelligence]
|
|
AAS[Azure AI Search<br/>Index]
|
|
EMB[Vector<br/>Embedding]
|
|
end
|
|
|
|
subgraph "Processing Pipeline"
|
|
HF[Hierarchy<br/>Fix]
|
|
CH[Content<br/>Chunking]
|
|
end
|
|
|
|
DS --> ADI
|
|
MD --> HF
|
|
ADI --> HF
|
|
HF --> CH
|
|
CH --> EMB
|
|
EMB --> AAS
|
|
|
|
style DS fill:#e1f5fe
|
|
style ADI fill:#e8f5e8
|
|
style AAS fill:#fff3e0
|
|
style EMB fill:#f3e5f5
|
|
style HF fill:#ffebee
|
|
style CH fill:#f1f8e9
|
|
```
|
|
|
|
### Document Processing Flow
|
|
|
|
```mermaid
|
|
flowchart TD
|
|
START([Document Input]) --> DOWNLOAD[Download Document]
|
|
DOWNLOAD --> EXTRACT[AI Content Extraction]
|
|
EXTRACT --> FIX[Hierarchy Structure Fix]
|
|
FIX --> CHUNK[Content Chunking]
|
|
CHUNK --> EMBED[Vector Embedding]
|
|
EMBED --> INDEX[Search Index Upload]
|
|
INDEX --> END([Processing Complete])
|
|
|
|
style START fill:#c8e6c9
|
|
style END fill:#c8e6c9
|
|
style EXTRACT fill:#e1f5fe
|
|
style FIX fill:#fff3e0
|
|
style CHUNK fill:#f3e5f5
|
|
```
|
|
|
|
## Quick Start
|
|
|
|
### Requirements
|
|
|
|
- Python 3.12+
|
|
- Azure subscription and related services
|
|
|
|
For detailed deployment guides, please refer to: [Deployment.md](Deployment.md)
|
|
|
|
### Install Dependencies
|
|
|
|
```bash
|
|
pip install -r requirements.txt
|
|
```
|
|
|
|
### Configuration Files
|
|
|
|
The system uses two main configuration files:
|
|
|
|
- `config.yaml` - Business configuration (data source, index configuration, etc.)
|
|
- `env.yaml` - Environment variable configuration (Azure service keys, etc.)
|
|
|
|
**Quick Start Configuration:**
|
|
|
|
```yaml
|
|
# env.yaml - Essential Azure services
|
|
search_service_name: "https://your-search-service.search.windows.net"
|
|
search_admin_key: "your-search-admin-key"
|
|
form_rec_resource: "https://your-di-service.cognitiveservices.azure.com/"
|
|
form_rec_key: "your-di-key"
|
|
embedding_model_endpoint: "https://your-openai.openai.azure.com/..."
|
|
embedding_model_key: "your-openai-key"
|
|
|
|
# config.yaml - Basic data source
|
|
data_configs:
|
|
- data_path: "https://your-blob-storage.blob.core.windows.net/container?sas-token"
|
|
index_schemas:
|
|
- index_name: "your-knowledge-index"
|
|
data_type: ["metadata", "document", "chunk"]
|
|
```
|
|
|
|
📖 **Detailed configuration instructions**: See the complete configuration parameters and examples [Deployment.md - Configuration file preparation](Deployment.md#Configuration-file-preparation)
|
|
|
|
### Run Application
|
|
|
|
```bash
|
|
# Direct execution
|
|
python main.py
|
|
|
|
# Or use predefined tasks
|
|
# (In VS Code, use Ctrl+Shift+P -> Run Task)
|
|
```
|
|
|
|
## 📚 Document Navigation
|
|
|
|
- **[Deployment Guide (Deployment.md)](Deployment.md)** - Complete deployment guide, including Docker and Kubernetes deployments
|
|
- **[Configuration instructions](Deployment.md#Configuration-file-preparation)** - Detailed configuration file description
|
|
|
|
## Project Structure
|
|
|
|
```
|
|
document-extractor/
|
|
├── main.py # Application entry point
|
|
├── app_config.py # Configuration management
|
|
├── business_layer.py # Business logic layer
|
|
├── document_task_processor.py # Document task processor
|
|
├── di_extractor.py # Document Intelligence extractor
|
|
├── azure_index_service.py # Azure Search service
|
|
├── blob_service.py # Blob storage service
|
|
├── chunk_service.py # Document chunking service
|
|
├── hierarchy_fix.py # Hierarchy structure repair
|
|
├── database.py # Database models
|
|
├── entity_models.py # Entity models
|
|
├── utils.py # Utility functions
|
|
├── config.yaml # Business configuration
|
|
├── env.yaml # Environment configuration
|
|
├── requirements.txt # Dependencies
|
|
├── Dockerfile # Docker build file
|
|
├── pyproject.toml # Project configuration
|
|
├── build-script/ # Build scripts
|
|
│ └── document-ai-indexer.sh
|
|
├── deploy/ # Deployment files
|
|
│ ├── document-ai-indexer.sh
|
|
│ ├── document-ai-indexer_k8s.yml
|
|
│ ├── document-ai-indexer_cronjob.yml
|
|
│ └── embedding-api-proxy_k8s.yml
|
|
└── doc/ # Documentation
|
|
```
|
|
|
|
## Core Components
|
|
|
|
### 1. Document Processing Pipeline
|
|
|
|
- **Document Loading**: Support loading from Azure Blob Storage or local file system
|
|
- **Content Extraction**: OCR and structured extraction using Azure Document Intelligence
|
|
- **Content Chunking**: Smart chunking algorithms maintaining semantic integrity
|
|
- **Vectorization**: Generate vector representations of document content
|
|
|
|
### 2. Index Management
|
|
|
|
- **Dynamic Index Creation**: Automatically create Azure AI Search indexes based on configuration
|
|
- **Batch Upload**: Efficient batch document upload
|
|
- **Metadata Management**: Complete document metadata indexing
|
|
- **Incremental Updates**: Support incremental document updates
|
|
|
|
### 3. Data Processing
|
|
|
|
- **Hierarchy Structure Repair**: Automatically fix title hierarchy in Markdown documents
|
|
- **Metadata Extraction**: Extract structured metadata from documents and filenames
|
|
- **Format Conversion**: Unified processing support for multiple document formats
|
|
|
|
|
|
## API and Integration
|
|
|
|
### Azure Service Integration
|
|
- **Azure Document Intelligence**: Document analysis and OCR
|
|
- **Azure AI Search**: Search indexing and querying
|
|
- **Azure Blob Storage**: Document storage
|
|
- **Azure OpenAI**: Vector embedding generation
|
|
|
|
### Database Support
|
|
- PostgreSQL (recommended)
|
|
- SQLite (development and testing)
|
|
- Other SQLAlchemy-supported databases
|
|
|
|
## Monitoring and Logging
|
|
|
|
The system provides comprehensive logging capabilities:
|
|
- Processing progress monitoring
|
|
- Error logging
|
|
- Performance statistics
|
|
- Task status tracking
|
|
|
|
View logs:
|
|
```bash
|
|
# Kubernetes environment
|
|
kubectl logs -f document-ai-indexer -n knowledge-agent
|
|
|
|
# Docker environment
|
|
docker logs -f <container-id>
|
|
```
|
|
|
|
|
|
## Development
|
|
|
|
### Development Mode
|
|
|
|
```bash
|
|
# Activate virtual environment
|
|
source .venv/bin/activate # Linux/Mac
|
|
# or
|
|
.venv\Scripts\activate # Windows
|
|
|
|
# Install development dependencies
|
|
pip install -e .[dev,test]
|
|
|
|
# Run code checks
|
|
mypy .
|
|
```
|
|
|
|
|
|
### Log Analysis
|
|
```bash
|
|
# View error logs
|
|
kubectl logs document-ai-indexer -n knowledge-agent | grep ERROR
|
|
|
|
# View processing progress
|
|
kubectl logs document-ai-indexer -n knowledge-agent | grep "Processing"
|
|
```
|
|
|
|
## Version Information
|
|
|
|
- **Current Version**: 0.20.4
|
|
- **Python Version**: 3.12+
|
|
- **Main Dependencies**:
|
|
- azure-ai-documentintelligence
|
|
- azure-search-documents
|
|
- SQLAlchemy 2.0.41
|
|
- openai 1.55.3
|
|
|
|
|
|
---
|
|
|
|
*Last updated: August 2025*
|