init

2025-09-26 17:15:54 +08:00
commit db0e5965ec
211 changed files with 40437 additions and 0 deletions
--- a/vw-document-ai-indexer/Deployment.md
+++ b/vw-document-ai-indexer/Deployment.md
@@ -0,0 +1,391 @@
+# Document Extractor - Deployment Guide
+
+This document provides a complete deployment guide for Document Extractor, including on-premises development, Docker containerized deployment, and Kubernetes production environment deployment.
+
+
+## 📋 Pre-deployment preparation
+
+### System Requirements
+- Python 3.12+
+- Docker (optional, for containerized deployment)
+- Kubernetes (production environment deployment)
+- Azure subscription and related services
+
+### Azure Service Preparation
+Ensure that you have configured the following Azure services:
+- Azure Document Intelligence
+- Azure AI Search
+- Azure Blob Storage
+- Azure OpenAI (for vector embeddings)
+
+## 🔧 Configuration File Preparation
+
+### 1. Environment Configuration (env.yaml)
+```yaml
+# Configuration file reference
+config: config.yaml
+
+# Processing settings
+njobs: 8  # Number of parallel processing jobs
+
+# Azure AI Search configuration
+search_service_name: "https://your-search-service.search.windows.net"
+search_admin_key: "your-search-admin-key"
+
+# Azure OpenAI Embedding service
+embedding_model_endpoint: "https://your-openai.openai.azure.com/openai/deployments/text-embedding-3-small/embeddings?api-version=2024-12-01-preview"
+embedding_model_key: "your-openai-key"
+VECTOR_DIMENSION: 1536
+FLAG_AOAI: "V3"  # Azure OpenAI version
+FLAG_EMBEDDING_MODEL: "AOAI"  # Embedding model type: "AOAI" or "qwen3-embedding-8b"
+
+# Document Intelligence configuration
+extract_method: "di+vision-llm"  # Extraction method: "di+vision-llm", "vision-llm", "di"
+form_rec_resource: "https://your-di-service.cognitiveservices.azure.com/"
+form_rec_key: "your-di-key"
+
+# Document Intelligence features
+di-hiRes: true  # High resolution OCR
+di-Formulas: true  # Mathematical expression detection
+di_allow_features_ext: "pdf;jpeg;jpg;png;bmp;tiff;heif"  # Supported file extensions
+
+# Vision and captioning models
+captioning_model_endpoint: "https://your-openai.openai.azure.com/openai/deployments/gpt-4o/chat/completions?api-version=2024-08-01-preview"
+captioning_model_key: "your-openai-key"
+vision_max_images: 200  # Maximum images to process per document (0 = no limit)
+vision_image_method: "openai"  # Image processing method: "openai" or "newapi"
+FIGURE_CONTENT_CLEAR: true  # Clear DI recognized image content
+
+
+
+# Blob storage for figures and DI results
+FIGURE_BLOB_ACCOUNT_URL: "https://your-storage.blob.core.windows.net/container?sas-token"
+DI_BLOB_ACCOUNT_URL: "https://your-storage.blob.core.windows.net/container?sas-token"
+
+# Database configuration
+DB_URI: "postgresql://user:password@host:port/database_name"
+
+# Processing flags
+header_fix: false  # Enable/disable header fixing
+```
+
+### 2. Business Configuration (config.yaml)
+
+```yaml
+# Main data configuration (array format)
+- data_path: "https://your-blob-storage.blob.core.windows.net/container?sas-token"
+  datasource_name: "CATOnline-cn" # data source name
+  data_dir: ""  # Optional local data directory
+  base_path: "/app/run_tmp"  # Temporary processing directory
+  
+  # File processing limits
+  process_file_num: 0  # 0 = process all files
+  process_file_last_modify: "2025-06-24 00:00:00"  # Only process files modified after this date
+  
+  # Chunking configuration
+  chunk_size: 2048  # Maximum tokens per chunk
+  token_overlap: 128  # Overlap between chunks
+  
+  # Index schemas configuration
+  index_schemas:
+    # Chunk-level index for search
+    - index_name: "your-knowledge-chunk-index"
+      data_type: ["metadata", "document", "chunk"]
+      field_type: "append"  # How to handle existing data
+      upload_batch_size: 50  # Documents per batch upload
+      
+      # Metadata fields to include
+      fields: [
+        "filepath", "timestamp", "title", "publisher", "publish_date",
+        "document_category", "document_code", "language_code",
+        "x_Standard_Regulation_Id", "x_Attachment_Type",
+        "x_Standard_Title_CN", "x_Standard_Title_EN",
+        "x_Standard_Published_State", "x_Standard_Drafting_Status",
+        "x_Standard_Range", "x_Standard_Kind", "x_Standard_No",
+        "x_Standard_Code", "x_Standard_Technical_Committee",
+        "x_Standard_Vehicle_Type", "x_Standard_Power_Type",
+        "x_Standard_CCS", "x_Standard_ICS",
+        "x_Standard_Published_Date", "x_Standard_Effective_Date",
+        "x_Regulation_Status", "x_Regulation_Title_CN",
+        "x_Regulation_Title_EN", "x_Regulation_Document_No",
+        "x_Regulation_Issued_Date", "x_Classification",
+        "x_Work_Group", "x_Reference_Standard",
+        "x_Replaced_by", "x_Refer_To", "func_uuid",
+        "update_time", "status"
+      ]
+      
+      # Vector configuration
+      vector_fields:
+        - field: "contentVector"
+          append_fields: ["content"]  # Fields to vectorize for content
+        - field: "full_metadata_vector"
+          append_fields: ["full_headers", "doc_metadata"]  # Metadata vectorization
+      
+      # Azure AI Search configuration
+      semantic_config_name: "default"
+      vector_config_name: "vectorSearchProfile"
+      update_by_field: "filepath"  # Field to use for updates
+      full_metadata_vector_fields: ["full_headers", "doc_metadata"]
+
+    # Document-level index
+    - index_name: "your-knowledge-document-index"
+      data_type: ["document", "metadata"]
+      field_type: "full"  # Replace entire documents
+      key_fields: ["filepath"]  # Primary key fields
+      upload_batch_size: 1
+      
+      fields: [
+        # Same field list as chunk index
+        "filepath", "timestamp", "title", "publisher"
+        # ... (same as above)
+      ]
+      
+      merge_content_fields: ["content"]  # Fields to merge from chunks
+      vector_fields:
+        - field: "full_metadata_vector"
+          append_fields: ["doc_metadata"]
+      
+      semantic_config_name: "default"
+      vector_config_name: "vectorSearchProfile"
+      update_by_field: "filepath"
+
+    # Regulation-specific index
+    - index_name: "your-regulation-index"
+      data_type: ["metadata"]
+      field_type: "full"
+      key_fields: ["x_Standard_Regulation_Id"]  # Regulation ID as key
+      upload_batch_size: 50
+      
+      fields: [
+        # Regulation-specific fields
+        "x_Standard_Regulation_Id", "x_Standard_Title_CN",
+        "x_Standard_Title_EN", "x_Regulation_Status"
+        # ... (regulation metadata fields)
+      ]
+      
+      vector_fields:
+        - field: "full_metadata_vector"
+          append_fields: ["doc_metadata"]
+      
+      update_by_field: "x_Standard_Regulation_Id"
+
+  # Field merging configuration
+  merge_fields:
+    - key: "doc_metadata"  # Combined metadata field
+      fields: [
+        "title", "publisher", "document_category", "document_code",
+        "x_Standard_Title_CN", "x_Standard_Title_EN",
+        "x_Standard_Published_State", "x_Standard_Drafting_Status"
+        # ... (all metadata fields to combine)
+      ]
+  
+  # Vector field configuration
+  full_metadata_vector_fields: ["full_headers", "doc_metadata"]
+```
+
+## 🚀 Deployment method
+
+### Method 1: Local Development Deployment
+
+#### 1. Environment Preparation
+```bash
+# Clone the repository
+git clone <repository-url>
+cd document-extractor
+
+# Create a virtual environment
+python -m venv .venv
+
+# Activate the virtual environment
+# Linux/Mac:
+source .venv/bin/activate
+# Windows:
+.venv\Scripts\activate
+
+# Install dependencies
+pip install -r requirements.txt
+```
+
+#### 2. Configuration File Setup
+```bash
+# Copy configuration templates
+cp config.yaml.example config.yaml
+cp env.yaml.example env.yaml
+
+# Edit config.yaml and env.yaml to actual configuration
+```
+
+#### 3. Run the application
+```bash
+# Directly run
+python main.py --config config.yaml --env env.yaml
+
+```
+
+
+### Method 2: Kubernetes Production Deployment
+
+#### 1. Build the image
+```bash
+docker build . -t document-ai-indexer:latest
+ 
+docker tag document-ai-indexer:latest acrsales2caiprd.azurecr.cn/document-ai-indexer:latest
+
+docker login acrsales2caiprd.azurecr.cn -u username -p password
+
+docker push acrsales2caiprd.azurecr.cn/document-ai-indexer:latest
+```
+
+
+#### 2. Prepare Configuration Files
+```bash
+# Create namespace (if not exists)
+kubectl create namespace knowledge-agent
+
+# Create ConfigMap
+kubectl create configmap document-ai-indexer-config \
+  --from-file=config.yaml \
+  --from-file=env.yaml \
+  -n knowledge-agent
+```
+
+#### 3. One-time Task Deployment
+```bash
+# Deploy Pod
+kubectl apply -f deploy/document-ai-indexer_k8s.yml -n knowledge-agent
+
+# Check status
+kubectl get pods -n knowledge-agent
+kubectl logs -f document-ai-indexer -n knowledge-agent
+```
+
+#### 4. CronJob Deployment
+```bash
+# Deploy CronJob
+kubectl apply -f deploy/document-ai-indexer-cronjob.yml -n knowledge-agent
+
+# Check CronJob status
+kubectl get cronjobs -n knowledge-agent
+
+# Check job history
+kubectl get jobs -n knowledge-agent
+
+# Trigger execution manually
+kubectl create job --from=cronjob/document-ai-indexer-cronjob manual-test -n knowledge-agent
+```
+
+## 📊 Deployment architecture diagram
+
+```mermaid
+graph TB
+    subgraph "Azure Cloud Services"
+        ABS[Azure Blob Storage]
+        ADI[Azure Document Intelligence]
+        AAS[Azure AI Search]
+        AOI[Azure OpenAI]
+    end
+    
+    subgraph "Kubernetes Cluster"
+        subgraph "Namespace: knowledge-agent"
+            CM[ConfigMap<br/>Configuration File]
+            CJ[CronJob<br/>Timing tasks]
+            POD[Pod<br/>Processing container]
+        end
+    end
+    
+    subgraph "Container Registry"
+        ACR[Azure Container Registry<br/>acrsales2caiprd.azurecr.cn]
+    end
+    
+    CM --> POD
+    CJ --> POD
+    ACR --> POD
+    
+    POD --> ABS
+    POD --> ADI
+    POD --> AAS
+    POD --> AOI
+    
+    style POD fill:#e1f5fe
+    style CM fill:#e8f5e8
+    style CJ fill:#fff3e0
+```
+
+
+
+## 📈 Monitoring and logging
+
+
+### View log
+```bash
+# Kubernetes environment
+kubectl logs -f document-ai-indexer -n knowledge-agent
+
+# Filter error logs
+kubectl logs document-ai-indexer -n knowledge-agent | grep ERROR
+
+# Check the processing progress
+kubectl logs document-ai-indexer -n knowledge-agent | grep "Processing"
+```
+
+
+#### 4. Kubernetes Deployment Issues
+**Symptoms**: Pod fails to start or keeps restarting
+**Solutions**:
+```bash
+# Check Pod Status
+kubectl describe pod document-ai-indexer -n knowledge-agent
+
+# Check Events
+kubectl get events -n knowledge-agent
+
+# Check ConfigMap
+kubectl get configmap document-ai-indexer-config -n knowledge-agent -o yaml
+```
+
+### Debugging Commands
+```bash
+# Check Configuration
+kubectl exec -it document-ai-indexer -n knowledge-agent -- cat /app/config.yaml
+
+# Enter Container for Debugging
+kubectl exec -it document-ai-indexer -n knowledge-agent -- /bin/bash
+
+# Manually run processing
+kubectl exec -it document-ai-indexer -n knowledge-agent -- python main.py --config config.yaml --env env.yaml
+```
+
+## 🔄 Update deployment
+
+### Application update
+```bash
+# Build new image
+docker build -t document-ai-indexer:v0.21.0 .
+
+# Push to repository
+docker tag document-ai-indexer:v0.21.0 acrsales2caiprd.azurecr.cn/document-ai-indexer:v0.21.0
+docker push aacrsales2caiprd.azurecr.cn/document-ai-indexer:v0.21.0
+
+# Update Kubernetes deployment
+kubectl set image cronjob/document-ai-indexer-cronjob \
+  document-ai-indexer=acrsales2caiprd.azurecr.cn/document-ai-indexer:v0.21.0 \
+  -n knowledge-agent
+```
+
+### Configuration update
+```bash
+# Update ConfigMap
+kubectl create configmap document-ai-indexer-config \
+  --from-file=config.yaml \
+  --from-file=env.yaml \
+  -n knowledge-agent \
+  --dry-run=client -o yaml | kubectl apply -f -
+
+# Restart the application (if needed)
+kubectl rollout restart cronjob/document-ai-indexer-cronjob -n knowledge-agent
+```
+
+
+---
+
+*Last updated: August 2025*