init
This commit is contained in:
260
vw-document-ai-indexer/README.md
Normal file
260
vw-document-ai-indexer/README.md
Normal file
@@ -0,0 +1,260 @@
|
||||
# Document AI Indexer
|
||||
|
||||
An intelligent document processing and indexing system based on Azure AI services, supporting content extraction, processing, and vectorized indexing for multiple document formats.
|
||||
|
||||
## Features
|
||||
|
||||
### 🚀 Core Features
|
||||
- **Multi-format Document Support**: PDF, DOCX, image formats, etc.
|
||||
- **Intelligent Content Extraction**: OCR and structured extraction using Azure Document Intelligence
|
||||
- **Document Chunking**: Smart document chunking and vectorization
|
||||
- **Azure AI Search Integration**: Automatically create search indexes and upload documents
|
||||
- **Metadata Management**: Complete document metadata extraction and management
|
||||
- **Hierarchy Structure Repair**: Automatically fix title hierarchy structure in Markdown documents
|
||||
|
||||
### 🔧 Technical Features
|
||||
- **Asynchronous Processing**: High-performance async processing based on asyncio
|
||||
- **Containerized Deployment**: Complete Docker and Kubernetes support
|
||||
- **Configuration Management**: Flexible YAML configuration file management
|
||||
- **Database Support**: SQLAlchemy ORM supporting multiple databases
|
||||
- **Resilient Processing**: Built-in retry mechanisms and error handling
|
||||
- **Monitoring & Logging**: Complete logging and progress monitoring
|
||||
|
||||
## System Architecture
|
||||
|
||||
```mermaid
|
||||
graph LR
|
||||
subgraph "Data Sources"
|
||||
DS[Document Sources<br/>Blob Storage/Local]
|
||||
MD[Metadata<br/>Extraction]
|
||||
end
|
||||
|
||||
subgraph "Azure AI Services"
|
||||
ADI[Azure Document<br/>Intelligence]
|
||||
AAS[Azure AI Search<br/>Index]
|
||||
EMB[Vector<br/>Embedding]
|
||||
end
|
||||
|
||||
subgraph "Processing Pipeline"
|
||||
HF[Hierarchy<br/>Fix]
|
||||
CH[Content<br/>Chunking]
|
||||
end
|
||||
|
||||
DS --> ADI
|
||||
MD --> HF
|
||||
ADI --> HF
|
||||
HF --> CH
|
||||
CH --> EMB
|
||||
EMB --> AAS
|
||||
|
||||
style DS fill:#e1f5fe
|
||||
style ADI fill:#e8f5e8
|
||||
style AAS fill:#fff3e0
|
||||
style EMB fill:#f3e5f5
|
||||
style HF fill:#ffebee
|
||||
style CH fill:#f1f8e9
|
||||
```
|
||||
|
||||
### Document Processing Flow
|
||||
|
||||
```mermaid
|
||||
flowchart TD
|
||||
START([Document Input]) --> DOWNLOAD[Download Document]
|
||||
DOWNLOAD --> EXTRACT[AI Content Extraction]
|
||||
EXTRACT --> FIX[Hierarchy Structure Fix]
|
||||
FIX --> CHUNK[Content Chunking]
|
||||
CHUNK --> EMBED[Vector Embedding]
|
||||
EMBED --> INDEX[Search Index Upload]
|
||||
INDEX --> END([Processing Complete])
|
||||
|
||||
style START fill:#c8e6c9
|
||||
style END fill:#c8e6c9
|
||||
style EXTRACT fill:#e1f5fe
|
||||
style FIX fill:#fff3e0
|
||||
style CHUNK fill:#f3e5f5
|
||||
```
|
||||
|
||||
## Quick Start
|
||||
|
||||
### Requirements
|
||||
|
||||
- Python 3.12+
|
||||
- Azure subscription and related services
|
||||
|
||||
For detailed deployment guides, please refer to: [Deployment.md](Deployment.md)
|
||||
|
||||
### Install Dependencies
|
||||
|
||||
```bash
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
|
||||
### Configuration Files
|
||||
|
||||
The system uses two main configuration files:
|
||||
|
||||
- `config.yaml` - Business configuration (data source, index configuration, etc.)
|
||||
- `env.yaml` - Environment variable configuration (Azure service keys, etc.)
|
||||
|
||||
**Quick Start Configuration:**
|
||||
|
||||
```yaml
|
||||
# env.yaml - Essential Azure services
|
||||
search_service_name: "https://your-search-service.search.windows.net"
|
||||
search_admin_key: "your-search-admin-key"
|
||||
form_rec_resource: "https://your-di-service.cognitiveservices.azure.com/"
|
||||
form_rec_key: "your-di-key"
|
||||
embedding_model_endpoint: "https://your-openai.openai.azure.com/..."
|
||||
embedding_model_key: "your-openai-key"
|
||||
|
||||
# config.yaml - Basic data source
|
||||
data_configs:
|
||||
- data_path: "https://your-blob-storage.blob.core.windows.net/container?sas-token"
|
||||
index_schemas:
|
||||
- index_name: "your-knowledge-index"
|
||||
data_type: ["metadata", "document", "chunk"]
|
||||
```
|
||||
|
||||
📖 **Detailed configuration instructions**: See the complete configuration parameters and examples [Deployment.md - Configuration file preparation](Deployment.md#Configuration-file-preparation)
|
||||
|
||||
### Run Application
|
||||
|
||||
```bash
|
||||
# Direct execution
|
||||
python main.py
|
||||
|
||||
# Or use predefined tasks
|
||||
# (In VS Code, use Ctrl+Shift+P -> Run Task)
|
||||
```
|
||||
|
||||
## 📚 Document Navigation
|
||||
|
||||
- **[Deployment Guide (Deployment.md)](Deployment.md)** - Complete deployment guide, including Docker and Kubernetes deployments
|
||||
- **[Configuration instructions](Deployment.md#Configuration-file-preparation)** - Detailed configuration file description
|
||||
|
||||
## Project Structure
|
||||
|
||||
```
|
||||
document-extractor/
|
||||
├── main.py # Application entry point
|
||||
├── app_config.py # Configuration management
|
||||
├── business_layer.py # Business logic layer
|
||||
├── document_task_processor.py # Document task processor
|
||||
├── di_extractor.py # Document Intelligence extractor
|
||||
├── azure_index_service.py # Azure Search service
|
||||
├── blob_service.py # Blob storage service
|
||||
├── chunk_service.py # Document chunking service
|
||||
├── hierarchy_fix.py # Hierarchy structure repair
|
||||
├── database.py # Database models
|
||||
├── entity_models.py # Entity models
|
||||
├── utils.py # Utility functions
|
||||
├── config.yaml # Business configuration
|
||||
├── env.yaml # Environment configuration
|
||||
├── requirements.txt # Dependencies
|
||||
├── Dockerfile # Docker build file
|
||||
├── pyproject.toml # Project configuration
|
||||
├── build-script/ # Build scripts
|
||||
│ └── document-ai-indexer.sh
|
||||
├── deploy/ # Deployment files
|
||||
│ ├── document-ai-indexer.sh
|
||||
│ ├── document-ai-indexer_k8s.yml
|
||||
│ ├── document-ai-indexer_cronjob.yml
|
||||
│ └── embedding-api-proxy_k8s.yml
|
||||
└── doc/ # Documentation
|
||||
```
|
||||
|
||||
## Core Components
|
||||
|
||||
### 1. Document Processing Pipeline
|
||||
|
||||
- **Document Loading**: Support loading from Azure Blob Storage or local file system
|
||||
- **Content Extraction**: OCR and structured extraction using Azure Document Intelligence
|
||||
- **Content Chunking**: Smart chunking algorithms maintaining semantic integrity
|
||||
- **Vectorization**: Generate vector representations of document content
|
||||
|
||||
### 2. Index Management
|
||||
|
||||
- **Dynamic Index Creation**: Automatically create Azure AI Search indexes based on configuration
|
||||
- **Batch Upload**: Efficient batch document upload
|
||||
- **Metadata Management**: Complete document metadata indexing
|
||||
- **Incremental Updates**: Support incremental document updates
|
||||
|
||||
### 3. Data Processing
|
||||
|
||||
- **Hierarchy Structure Repair**: Automatically fix title hierarchy in Markdown documents
|
||||
- **Metadata Extraction**: Extract structured metadata from documents and filenames
|
||||
- **Format Conversion**: Unified processing support for multiple document formats
|
||||
|
||||
|
||||
## API and Integration
|
||||
|
||||
### Azure Service Integration
|
||||
- **Azure Document Intelligence**: Document analysis and OCR
|
||||
- **Azure AI Search**: Search indexing and querying
|
||||
- **Azure Blob Storage**: Document storage
|
||||
- **Azure OpenAI**: Vector embedding generation
|
||||
|
||||
### Database Support
|
||||
- PostgreSQL (recommended)
|
||||
- SQLite (development and testing)
|
||||
- Other SQLAlchemy-supported databases
|
||||
|
||||
## Monitoring and Logging
|
||||
|
||||
The system provides comprehensive logging capabilities:
|
||||
- Processing progress monitoring
|
||||
- Error logging
|
||||
- Performance statistics
|
||||
- Task status tracking
|
||||
|
||||
View logs:
|
||||
```bash
|
||||
# Kubernetes environment
|
||||
kubectl logs -f document-ai-indexer -n knowledge-agent
|
||||
|
||||
# Docker environment
|
||||
docker logs -f <container-id>
|
||||
```
|
||||
|
||||
|
||||
## Development
|
||||
|
||||
### Development Mode
|
||||
|
||||
```bash
|
||||
# Activate virtual environment
|
||||
source .venv/bin/activate # Linux/Mac
|
||||
# or
|
||||
.venv\Scripts\activate # Windows
|
||||
|
||||
# Install development dependencies
|
||||
pip install -e .[dev,test]
|
||||
|
||||
# Run code checks
|
||||
mypy .
|
||||
```
|
||||
|
||||
|
||||
### Log Analysis
|
||||
```bash
|
||||
# View error logs
|
||||
kubectl logs document-ai-indexer -n knowledge-agent | grep ERROR
|
||||
|
||||
# View processing progress
|
||||
kubectl logs document-ai-indexer -n knowledge-agent | grep "Processing"
|
||||
```
|
||||
|
||||
## Version Information
|
||||
|
||||
- **Current Version**: 0.20.4
|
||||
- **Python Version**: 3.12+
|
||||
- **Main Dependencies**:
|
||||
- azure-ai-documentintelligence
|
||||
- azure-search-documents
|
||||
- SQLAlchemy 2.0.41
|
||||
- openai 1.55.3
|
||||
|
||||
|
||||
---
|
||||
|
||||
*Last updated: August 2025*
|
||||
Reference in New Issue
Block a user