init

2025-09-26 17:15:54 +08:00
commit db0e5965ec
211 changed files with 40437 additions and 0 deletions
--- a/vw-document-ai-indexer/README.md
+++ b/vw-document-ai-indexer/README.md
@@ -0,0 +1,260 @@
+# Document AI Indexer
+
+An intelligent document processing and indexing system based on Azure AI services, supporting content extraction, processing, and vectorized indexing for multiple document formats.
+
+## Features
+
+### 🚀 Core Features
+- **Multi-format Document Support**: PDF, DOCX, image formats, etc.
+- **Intelligent Content Extraction**: OCR and structured extraction using Azure Document Intelligence
+- **Document Chunking**: Smart document chunking and vectorization
+- **Azure AI Search Integration**: Automatically create search indexes and upload documents
+- **Metadata Management**: Complete document metadata extraction and management
+- **Hierarchy Structure Repair**: Automatically fix title hierarchy structure in Markdown documents
+
+### 🔧 Technical Features
+- **Asynchronous Processing**: High-performance async processing based on asyncio
+- **Containerized Deployment**: Complete Docker and Kubernetes support
+- **Configuration Management**: Flexible YAML configuration file management
+- **Database Support**: SQLAlchemy ORM supporting multiple databases
+- **Resilient Processing**: Built-in retry mechanisms and error handling
+- **Monitoring & Logging**: Complete logging and progress monitoring
+
+## System Architecture
+
+```mermaid
+graph LR
+    subgraph "Data Sources"
+        DS[Document Sources<br/>Blob Storage/Local]
+        MD[Metadata<br/>Extraction]
+    end
+    
+    subgraph "Azure AI Services"
+        ADI[Azure Document<br/>Intelligence]
+        AAS[Azure AI Search<br/>Index]
+        EMB[Vector<br/>Embedding]
+    end
+    
+    subgraph "Processing Pipeline"
+        HF[Hierarchy<br/>Fix]
+        CH[Content<br/>Chunking]
+    end
+    
+    DS --> ADI
+    MD --> HF
+    ADI --> HF
+    HF --> CH
+    CH --> EMB
+    EMB --> AAS
+    
+    style DS fill:#e1f5fe
+    style ADI fill:#e8f5e8
+    style AAS fill:#fff3e0
+    style EMB fill:#f3e5f5
+    style HF fill:#ffebee
+    style CH fill:#f1f8e9
+```
+
+### Document Processing Flow
+
+```mermaid
+flowchart TD
+    START([Document Input]) --> DOWNLOAD[Download Document]
+    DOWNLOAD --> EXTRACT[AI Content Extraction]
+    EXTRACT --> FIX[Hierarchy Structure Fix]
+    FIX --> CHUNK[Content Chunking]
+    CHUNK --> EMBED[Vector Embedding]
+    EMBED --> INDEX[Search Index Upload]
+    INDEX --> END([Processing Complete])
+    
+    style START fill:#c8e6c9
+    style END fill:#c8e6c9
+    style EXTRACT fill:#e1f5fe
+    style FIX fill:#fff3e0
+    style CHUNK fill:#f3e5f5
+```
+
+## Quick Start
+
+### Requirements
+
+- Python 3.12+
+- Azure subscription and related services
+
+For detailed deployment guides, please refer to: [Deployment.md](Deployment.md)
+
+### Install Dependencies
+
+```bash
+pip install -r requirements.txt
+```
+
+### Configuration Files
+
+The system uses two main configuration files:
+
+- `config.yaml` - Business configuration (data source, index configuration, etc.)
+- `env.yaml` - Environment variable configuration (Azure service keys, etc.)
+
+**Quick Start Configuration:**
+
+```yaml
+# env.yaml - Essential Azure services
+search_service_name: "https://your-search-service.search.windows.net"
+search_admin_key: "your-search-admin-key"
+form_rec_resource: "https://your-di-service.cognitiveservices.azure.com/"
+form_rec_key: "your-di-key"
+embedding_model_endpoint: "https://your-openai.openai.azure.com/..."
+embedding_model_key: "your-openai-key"
+
+# config.yaml - Basic data source
+data_configs:
+  - data_path: "https://your-blob-storage.blob.core.windows.net/container?sas-token"
+    index_schemas:
+      - index_name: "your-knowledge-index"
+        data_type: ["metadata", "document", "chunk"]
+```
+
+📖 **Detailed configuration instructions**: See the complete configuration parameters and examples [Deployment.md - Configuration file preparation](Deployment.md#Configuration-file-preparation)
+
+### Run Application
+
+```bash
+# Direct execution
+python main.py
+
+# Or use predefined tasks
+# (In VS Code, use Ctrl+Shift+P -> Run Task)
+```
+
+## 📚 Document Navigation
+
+- **[Deployment Guide (Deployment.md)](Deployment.md)** - Complete deployment guide, including Docker and Kubernetes deployments
+- **[Configuration instructions](Deployment.md#Configuration-file-preparation)** - Detailed configuration file description
+
+## Project Structure
+
+```
+document-extractor/
+├── main.py                     # Application entry point
+├── app_config.py              # Configuration management
+├── business_layer.py          # Business logic layer
+├── document_task_processor.py # Document task processor
+├── di_extractor.py           # Document Intelligence extractor
+├── azure_index_service.py    # Azure Search service
+├── blob_service.py           # Blob storage service
+├── chunk_service.py          # Document chunking service
+├── hierarchy_fix.py          # Hierarchy structure repair
+├── database.py               # Database models
+├── entity_models.py          # Entity models
+├── utils.py                  # Utility functions
+├── config.yaml               # Business configuration
+├── env.yaml                  # Environment configuration
+├── requirements.txt          # Dependencies
+├── Dockerfile               # Docker build file
+├── pyproject.toml           # Project configuration
+├── build-script/            # Build scripts
+│   └── document-ai-indexer.sh
+├── deploy/                  # Deployment files
+│   ├── document-ai-indexer.sh
+│   ├── document-ai-indexer_k8s.yml
+│   ├── document-ai-indexer_cronjob.yml
+│   └── embedding-api-proxy_k8s.yml
+└── doc/                     # Documentation
+```
+
+## Core Components
+
+### 1. Document Processing Pipeline
+
+- **Document Loading**: Support loading from Azure Blob Storage or local file system
+- **Content Extraction**: OCR and structured extraction using Azure Document Intelligence
+- **Content Chunking**: Smart chunking algorithms maintaining semantic integrity
+- **Vectorization**: Generate vector representations of document content
+
+### 2. Index Management
+
+- **Dynamic Index Creation**: Automatically create Azure AI Search indexes based on configuration
+- **Batch Upload**: Efficient batch document upload
+- **Metadata Management**: Complete document metadata indexing
+- **Incremental Updates**: Support incremental document updates
+
+### 3. Data Processing
+
+- **Hierarchy Structure Repair**: Automatically fix title hierarchy in Markdown documents
+- **Metadata Extraction**: Extract structured metadata from documents and filenames
+- **Format Conversion**: Unified processing support for multiple document formats
+
+
+## API and Integration
+
+### Azure Service Integration
+- **Azure Document Intelligence**: Document analysis and OCR
+- **Azure AI Search**: Search indexing and querying
+- **Azure Blob Storage**: Document storage
+- **Azure OpenAI**: Vector embedding generation
+
+### Database Support
+- PostgreSQL (recommended)
+- SQLite (development and testing)
+- Other SQLAlchemy-supported databases
+
+## Monitoring and Logging
+
+The system provides comprehensive logging capabilities:
+- Processing progress monitoring
+- Error logging
+- Performance statistics
+- Task status tracking
+
+View logs:
+```bash
+# Kubernetes environment
+kubectl logs -f document-ai-indexer -n knowledge-agent
+
+# Docker environment
+docker logs -f <container-id>
+```
+
+
+## Development
+
+### Development Mode
+
+```bash
+# Activate virtual environment
+source .venv/bin/activate  # Linux/Mac
+# or
+.venv\Scripts\activate     # Windows
+
+# Install development dependencies
+pip install -e .[dev,test]
+
+# Run code checks
+mypy .
+```
+
+
+### Log Analysis
+```bash
+# View error logs
+kubectl logs document-ai-indexer -n knowledge-agent | grep ERROR
+
+# View processing progress
+kubectl logs document-ai-indexer -n knowledge-agent | grep "Processing"
+```
+
+## Version Information
+
+- **Current Version**: 0.20.4
+- **Python Version**: 3.12+
+- **Main Dependencies**: 
+  - azure-ai-documentintelligence
+  - azure-search-documents
+  - SQLAlchemy 2.0.41
+  - openai 1.55.3
+
+
+---
+
+*Last updated: August 2025*