Files
2025-09-26 17:15:54 +08:00

11 KiB

Document Extractor - Deployment Guide

This document provides a complete deployment guide for Document Extractor, including on-premises development, Docker containerized deployment, and Kubernetes production environment deployment.

📋 Pre-deployment preparation

System Requirements

  • Python 3.12+
  • Docker (optional, for containerized deployment)
  • Kubernetes (production environment deployment)
  • Azure subscription and related services

Azure Service Preparation

Ensure that you have configured the following Azure services:

  • Azure Document Intelligence
  • Azure AI Search
  • Azure Blob Storage
  • Azure OpenAI (for vector embeddings)

🔧 Configuration File Preparation

1. Environment Configuration (env.yaml)

# Configuration file reference
config: config.yaml

# Processing settings
njobs: 8  # Number of parallel processing jobs

# Azure AI Search configuration
search_service_name: "https://your-search-service.search.windows.net"
search_admin_key: "your-search-admin-key"

# Azure OpenAI Embedding service
embedding_model_endpoint: "https://your-openai.openai.azure.com/openai/deployments/text-embedding-3-small/embeddings?api-version=2024-12-01-preview"
embedding_model_key: "your-openai-key"
VECTOR_DIMENSION: 1536
FLAG_AOAI: "V3"  # Azure OpenAI version
FLAG_EMBEDDING_MODEL: "AOAI"  # Embedding model type: "AOAI" or "qwen3-embedding-8b"

# Document Intelligence configuration
extract_method: "di+vision-llm"  # Extraction method: "di+vision-llm", "vision-llm", "di"
form_rec_resource: "https://your-di-service.cognitiveservices.azure.com/"
form_rec_key: "your-di-key"

# Document Intelligence features
di-hiRes: true  # High resolution OCR
di-Formulas: true  # Mathematical expression detection
di_allow_features_ext: "pdf;jpeg;jpg;png;bmp;tiff;heif"  # Supported file extensions

# Vision and captioning models
captioning_model_endpoint: "https://your-openai.openai.azure.com/openai/deployments/gpt-4o/chat/completions?api-version=2024-08-01-preview"
captioning_model_key: "your-openai-key"
vision_max_images: 200  # Maximum images to process per document (0 = no limit)
vision_image_method: "openai"  # Image processing method: "openai" or "newapi"
FIGURE_CONTENT_CLEAR: true  # Clear DI recognized image content



# Blob storage for figures and DI results
FIGURE_BLOB_ACCOUNT_URL: "https://your-storage.blob.core.windows.net/container?sas-token"
DI_BLOB_ACCOUNT_URL: "https://your-storage.blob.core.windows.net/container?sas-token"

# Database configuration
DB_URI: "postgresql://user:password@host:port/database_name"

# Processing flags
header_fix: false  # Enable/disable header fixing

2. Business Configuration (config.yaml)

# Main data configuration (array format)
- data_path: "https://your-blob-storage.blob.core.windows.net/container?sas-token"
  datasource_name: "CATOnline-cn" # data source name
  data_dir: ""  # Optional local data directory
  base_path: "/app/run_tmp"  # Temporary processing directory
  
  # File processing limits
  process_file_num: 0  # 0 = process all files
  process_file_last_modify: "2025-06-24 00:00:00"  # Only process files modified after this date
  
  # Chunking configuration
  chunk_size: 2048  # Maximum tokens per chunk
  token_overlap: 128  # Overlap between chunks
  
  # Index schemas configuration
  index_schemas:
    # Chunk-level index for search
    - index_name: "your-knowledge-chunk-index"
      data_type: ["metadata", "document", "chunk"]
      field_type: "append"  # How to handle existing data
      upload_batch_size: 50  # Documents per batch upload
      
      # Metadata fields to include
      fields: [
        "filepath", "timestamp", "title", "publisher", "publish_date",
        "document_category", "document_code", "language_code",
        "x_Standard_Regulation_Id", "x_Attachment_Type",
        "x_Standard_Title_CN", "x_Standard_Title_EN",
        "x_Standard_Published_State", "x_Standard_Drafting_Status",
        "x_Standard_Range", "x_Standard_Kind", "x_Standard_No",
        "x_Standard_Code", "x_Standard_Technical_Committee",
        "x_Standard_Vehicle_Type", "x_Standard_Power_Type",
        "x_Standard_CCS", "x_Standard_ICS",
        "x_Standard_Published_Date", "x_Standard_Effective_Date",
        "x_Regulation_Status", "x_Regulation_Title_CN",
        "x_Regulation_Title_EN", "x_Regulation_Document_No",
        "x_Regulation_Issued_Date", "x_Classification",
        "x_Work_Group", "x_Reference_Standard",
        "x_Replaced_by", "x_Refer_To", "func_uuid",
        "update_time", "status"
      ]
      
      # Vector configuration
      vector_fields:
        - field: "contentVector"
          append_fields: ["content"]  # Fields to vectorize for content
        - field: "full_metadata_vector"
          append_fields: ["full_headers", "doc_metadata"]  # Metadata vectorization
      
      # Azure AI Search configuration
      semantic_config_name: "default"
      vector_config_name: "vectorSearchProfile"
      update_by_field: "filepath"  # Field to use for updates
      full_metadata_vector_fields: ["full_headers", "doc_metadata"]

    # Document-level index
    - index_name: "your-knowledge-document-index"
      data_type: ["document", "metadata"]
      field_type: "full"  # Replace entire documents
      key_fields: ["filepath"]  # Primary key fields
      upload_batch_size: 1
      
      fields: [
        # Same field list as chunk index
        "filepath", "timestamp", "title", "publisher"
        # ... (same as above)
      ]
      
      merge_content_fields: ["content"]  # Fields to merge from chunks
      vector_fields:
        - field: "full_metadata_vector"
          append_fields: ["doc_metadata"]
      
      semantic_config_name: "default"
      vector_config_name: "vectorSearchProfile"
      update_by_field: "filepath"

    # Regulation-specific index
    - index_name: "your-regulation-index"
      data_type: ["metadata"]
      field_type: "full"
      key_fields: ["x_Standard_Regulation_Id"]  # Regulation ID as key
      upload_batch_size: 50
      
      fields: [
        # Regulation-specific fields
        "x_Standard_Regulation_Id", "x_Standard_Title_CN",
        "x_Standard_Title_EN", "x_Regulation_Status"
        # ... (regulation metadata fields)
      ]
      
      vector_fields:
        - field: "full_metadata_vector"
          append_fields: ["doc_metadata"]
      
      update_by_field: "x_Standard_Regulation_Id"

  # Field merging configuration
  merge_fields:
    - key: "doc_metadata"  # Combined metadata field
      fields: [
        "title", "publisher", "document_category", "document_code",
        "x_Standard_Title_CN", "x_Standard_Title_EN",
        "x_Standard_Published_State", "x_Standard_Drafting_Status"
        # ... (all metadata fields to combine)
      ]
  
  # Vector field configuration
  full_metadata_vector_fields: ["full_headers", "doc_metadata"]

🚀 Deployment method

Method 1: Local Development Deployment

1. Environment Preparation

# Clone the repository
git clone <repository-url>
cd document-extractor

# Create a virtual environment
python -m venv .venv

# Activate the virtual environment
# Linux/Mac:
source .venv/bin/activate
# Windows:
.venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

2. Configuration File Setup

# Copy configuration templates
cp config.yaml.example config.yaml
cp env.yaml.example env.yaml

# Edit config.yaml and env.yaml to actual configuration

3. Run the application

# Directly run
python main.py --config config.yaml --env env.yaml

Method 2: Kubernetes Production Deployment

1. Build the image

docker build . -t document-ai-indexer:latest
 
docker tag document-ai-indexer:latest acrsales2caiprd.azurecr.cn/document-ai-indexer:latest

docker login acrsales2caiprd.azurecr.cn -u username -p password

docker push acrsales2caiprd.azurecr.cn/document-ai-indexer:latest

2. Prepare Configuration Files

# Create namespace (if not exists)
kubectl create namespace knowledge-agent

# Create ConfigMap
kubectl create configmap document-ai-indexer-config \
  --from-file=config.yaml \
  --from-file=env.yaml \
  -n knowledge-agent

3. One-time Task Deployment

# Deploy Pod
kubectl apply -f deploy/document-ai-indexer_k8s.yml -n knowledge-agent

# Check status
kubectl get pods -n knowledge-agent
kubectl logs -f document-ai-indexer -n knowledge-agent

4. CronJob Deployment

# Deploy CronJob
kubectl apply -f deploy/document-ai-indexer-cronjob.yml -n knowledge-agent

# Check CronJob status
kubectl get cronjobs -n knowledge-agent

# Check job history
kubectl get jobs -n knowledge-agent

# Trigger execution manually
kubectl create job --from=cronjob/document-ai-indexer-cronjob manual-test -n knowledge-agent

📊 Deployment architecture diagram

graph TB
    subgraph "Azure Cloud Services"
        ABS[Azure Blob Storage]
        ADI[Azure Document Intelligence]
        AAS[Azure AI Search]
        AOI[Azure OpenAI]
    end
    
    subgraph "Kubernetes Cluster"
        subgraph "Namespace: knowledge-agent"
            CM[ConfigMap<br/>Configuration File]
            CJ[CronJob<br/>Timing tasks]
            POD[Pod<br/>Processing container]
        end
    end
    
    subgraph "Container Registry"
        ACR[Azure Container Registry<br/>acrsales2caiprd.azurecr.cn]
    end
    
    CM --> POD
    CJ --> POD
    ACR --> POD
    
    POD --> ABS
    POD --> ADI
    POD --> AAS
    POD --> AOI
    
    style POD fill:#e1f5fe
    style CM fill:#e8f5e8
    style CJ fill:#fff3e0

📈 Monitoring and logging

View log

# Kubernetes environment
kubectl logs -f document-ai-indexer -n knowledge-agent

# Filter error logs
kubectl logs document-ai-indexer -n knowledge-agent | grep ERROR

# Check the processing progress
kubectl logs document-ai-indexer -n knowledge-agent | grep "Processing"

4. Kubernetes Deployment Issues

Symptoms: Pod fails to start or keeps restarting Solutions:

# Check Pod Status
kubectl describe pod document-ai-indexer -n knowledge-agent

# Check Events
kubectl get events -n knowledge-agent

# Check ConfigMap
kubectl get configmap document-ai-indexer-config -n knowledge-agent -o yaml

Debugging Commands

# Check Configuration
kubectl exec -it document-ai-indexer -n knowledge-agent -- cat /app/config.yaml

# Enter Container for Debugging
kubectl exec -it document-ai-indexer -n knowledge-agent -- /bin/bash

# Manually run processing
kubectl exec -it document-ai-indexer -n knowledge-agent -- python main.py --config config.yaml --env env.yaml

🔄 Update deployment

Application update

# Build new image
docker build -t document-ai-indexer:v0.21.0 .

# Push to repository
docker tag document-ai-indexer:v0.21.0 acrsales2caiprd.azurecr.cn/document-ai-indexer:v0.21.0
docker push aacrsales2caiprd.azurecr.cn/document-ai-indexer:v0.21.0

# Update Kubernetes deployment
kubectl set image cronjob/document-ai-indexer-cronjob \
  document-ai-indexer=acrsales2caiprd.azurecr.cn/document-ai-indexer:v0.21.0 \
  -n knowledge-agent

Configuration update

# Update ConfigMap
kubectl create configmap document-ai-indexer-config \
  --from-file=config.yaml \
  --from-file=env.yaml \
  -n knowledge-agent \
  --dry-run=client -o yaml | kubectl apply -f -

# Restart the application (if needed)
kubectl rollout restart cronjob/document-ai-indexer-cronjob -n knowledge-agent

Last updated: August 2025