# Document Extractor - Deployment Guide This document provides a complete deployment guide for Document Extractor, including on-premises development, Docker containerized deployment, and Kubernetes production environment deployment. ## 📋 Pre-deployment preparation ### System Requirements - Python 3.12+ - Docker (optional, for containerized deployment) - Kubernetes (production environment deployment) - Azure subscription and related services ### Azure Service Preparation Ensure that you have configured the following Azure services: - Azure Document Intelligence - Azure AI Search - Azure Blob Storage - Azure OpenAI (for vector embeddings) ## 🔧 Configuration File Preparation ### 1. Environment Configuration (env.yaml) ```yaml # Configuration file reference config: config.yaml # Processing settings njobs: 8 # Number of parallel processing jobs # Azure AI Search configuration search_service_name: "https://your-search-service.search.windows.net" search_admin_key: "your-search-admin-key" # Azure OpenAI Embedding service embedding_model_endpoint: "https://your-openai.openai.azure.com/openai/deployments/text-embedding-3-small/embeddings?api-version=2024-12-01-preview" embedding_model_key: "your-openai-key" VECTOR_DIMENSION: 1536 FLAG_AOAI: "V3" # Azure OpenAI version FLAG_EMBEDDING_MODEL: "AOAI" # Embedding model type: "AOAI" or "qwen3-embedding-8b" # Document Intelligence configuration extract_method: "di+vision-llm" # Extraction method: "di+vision-llm", "vision-llm", "di" form_rec_resource: "https://your-di-service.cognitiveservices.azure.com/" form_rec_key: "your-di-key" # Document Intelligence features di-hiRes: true # High resolution OCR di-Formulas: true # Mathematical expression detection di_allow_features_ext: "pdf;jpeg;jpg;png;bmp;tiff;heif" # Supported file extensions # Vision and captioning models captioning_model_endpoint: "https://your-openai.openai.azure.com/openai/deployments/gpt-4o/chat/completions?api-version=2024-08-01-preview" captioning_model_key: "your-openai-key" vision_max_images: 200 # Maximum images to process per document (0 = no limit) vision_image_method: "openai" # Image processing method: "openai" or "newapi" FIGURE_CONTENT_CLEAR: true # Clear DI recognized image content # Blob storage for figures and DI results FIGURE_BLOB_ACCOUNT_URL: "https://your-storage.blob.core.windows.net/container?sas-token" DI_BLOB_ACCOUNT_URL: "https://your-storage.blob.core.windows.net/container?sas-token" # Database configuration DB_URI: "postgresql://user:password@host:port/database_name" # Processing flags header_fix: false # Enable/disable header fixing ``` ### 2. Business Configuration (config.yaml) ```yaml # Main data configuration (array format) - data_path: "https://your-blob-storage.blob.core.windows.net/container?sas-token" datasource_name: "CATOnline-cn" # data source name data_dir: "" # Optional local data directory base_path: "/app/run_tmp" # Temporary processing directory # File processing limits process_file_num: 0 # 0 = process all files process_file_last_modify: "2025-06-24 00:00:00" # Only process files modified after this date # Chunking configuration chunk_size: 2048 # Maximum tokens per chunk token_overlap: 128 # Overlap between chunks # Index schemas configuration index_schemas: # Chunk-level index for search - index_name: "your-knowledge-chunk-index" data_type: ["metadata", "document", "chunk"] field_type: "append" # How to handle existing data upload_batch_size: 50 # Documents per batch upload # Metadata fields to include fields: [ "filepath", "timestamp", "title", "publisher", "publish_date", "document_category", "document_code", "language_code", "x_Standard_Regulation_Id", "x_Attachment_Type", "x_Standard_Title_CN", "x_Standard_Title_EN", "x_Standard_Published_State", "x_Standard_Drafting_Status", "x_Standard_Range", "x_Standard_Kind", "x_Standard_No", "x_Standard_Code", "x_Standard_Technical_Committee", "x_Standard_Vehicle_Type", "x_Standard_Power_Type", "x_Standard_CCS", "x_Standard_ICS", "x_Standard_Published_Date", "x_Standard_Effective_Date", "x_Regulation_Status", "x_Regulation_Title_CN", "x_Regulation_Title_EN", "x_Regulation_Document_No", "x_Regulation_Issued_Date", "x_Classification", "x_Work_Group", "x_Reference_Standard", "x_Replaced_by", "x_Refer_To", "func_uuid", "update_time", "status" ] # Vector configuration vector_fields: - field: "contentVector" append_fields: ["content"] # Fields to vectorize for content - field: "full_metadata_vector" append_fields: ["full_headers", "doc_metadata"] # Metadata vectorization # Azure AI Search configuration semantic_config_name: "default" vector_config_name: "vectorSearchProfile" update_by_field: "filepath" # Field to use for updates full_metadata_vector_fields: ["full_headers", "doc_metadata"] # Document-level index - index_name: "your-knowledge-document-index" data_type: ["document", "metadata"] field_type: "full" # Replace entire documents key_fields: ["filepath"] # Primary key fields upload_batch_size: 1 fields: [ # Same field list as chunk index "filepath", "timestamp", "title", "publisher" # ... (same as above) ] merge_content_fields: ["content"] # Fields to merge from chunks vector_fields: - field: "full_metadata_vector" append_fields: ["doc_metadata"] semantic_config_name: "default" vector_config_name: "vectorSearchProfile" update_by_field: "filepath" # Regulation-specific index - index_name: "your-regulation-index" data_type: ["metadata"] field_type: "full" key_fields: ["x_Standard_Regulation_Id"] # Regulation ID as key upload_batch_size: 50 fields: [ # Regulation-specific fields "x_Standard_Regulation_Id", "x_Standard_Title_CN", "x_Standard_Title_EN", "x_Regulation_Status" # ... (regulation metadata fields) ] vector_fields: - field: "full_metadata_vector" append_fields: ["doc_metadata"] update_by_field: "x_Standard_Regulation_Id" # Field merging configuration merge_fields: - key: "doc_metadata" # Combined metadata field fields: [ "title", "publisher", "document_category", "document_code", "x_Standard_Title_CN", "x_Standard_Title_EN", "x_Standard_Published_State", "x_Standard_Drafting_Status" # ... (all metadata fields to combine) ] # Vector field configuration full_metadata_vector_fields: ["full_headers", "doc_metadata"] ``` ## 🚀 Deployment method ### Method 1: Local Development Deployment #### 1. Environment Preparation ```bash # Clone the repository git clone cd document-extractor # Create a virtual environment python -m venv .venv # Activate the virtual environment # Linux/Mac: source .venv/bin/activate # Windows: .venv\Scripts\activate # Install dependencies pip install -r requirements.txt ``` #### 2. Configuration File Setup ```bash # Copy configuration templates cp config.yaml.example config.yaml cp env.yaml.example env.yaml # Edit config.yaml and env.yaml to actual configuration ``` #### 3. Run the application ```bash # Directly run python main.py --config config.yaml --env env.yaml ``` ### Method 2: Kubernetes Production Deployment #### 1. Build the image ```bash docker build . -t document-ai-indexer:latest docker tag document-ai-indexer:latest acrsales2caiprd.azurecr.cn/document-ai-indexer:latest docker login acrsales2caiprd.azurecr.cn -u username -p password docker push acrsales2caiprd.azurecr.cn/document-ai-indexer:latest ``` #### 2. Prepare Configuration Files ```bash # Create namespace (if not exists) kubectl create namespace knowledge-agent # Create ConfigMap kubectl create configmap document-ai-indexer-config \ --from-file=config.yaml \ --from-file=env.yaml \ -n knowledge-agent ``` #### 3. One-time Task Deployment ```bash # Deploy Pod kubectl apply -f deploy/document-ai-indexer_k8s.yml -n knowledge-agent # Check status kubectl get pods -n knowledge-agent kubectl logs -f document-ai-indexer -n knowledge-agent ``` #### 4. CronJob Deployment ```bash # Deploy CronJob kubectl apply -f deploy/document-ai-indexer-cronjob.yml -n knowledge-agent # Check CronJob status kubectl get cronjobs -n knowledge-agent # Check job history kubectl get jobs -n knowledge-agent # Trigger execution manually kubectl create job --from=cronjob/document-ai-indexer-cronjob manual-test -n knowledge-agent ``` ## 📊 Deployment architecture diagram ```mermaid graph TB subgraph "Azure Cloud Services" ABS[Azure Blob Storage] ADI[Azure Document Intelligence] AAS[Azure AI Search] AOI[Azure OpenAI] end subgraph "Kubernetes Cluster" subgraph "Namespace: knowledge-agent" CM[ConfigMap
Configuration File] CJ[CronJob
Timing tasks] POD[Pod
Processing container] end end subgraph "Container Registry" ACR[Azure Container Registry
acrsales2caiprd.azurecr.cn] end CM --> POD CJ --> POD ACR --> POD POD --> ABS POD --> ADI POD --> AAS POD --> AOI style POD fill:#e1f5fe style CM fill:#e8f5e8 style CJ fill:#fff3e0 ``` ## 📈 Monitoring and logging ### View log ```bash # Kubernetes environment kubectl logs -f document-ai-indexer -n knowledge-agent # Filter error logs kubectl logs document-ai-indexer -n knowledge-agent | grep ERROR # Check the processing progress kubectl logs document-ai-indexer -n knowledge-agent | grep "Processing" ``` #### 4. Kubernetes Deployment Issues **Symptoms**: Pod fails to start or keeps restarting **Solutions**: ```bash # Check Pod Status kubectl describe pod document-ai-indexer -n knowledge-agent # Check Events kubectl get events -n knowledge-agent # Check ConfigMap kubectl get configmap document-ai-indexer-config -n knowledge-agent -o yaml ``` ### Debugging Commands ```bash # Check Configuration kubectl exec -it document-ai-indexer -n knowledge-agent -- cat /app/config.yaml # Enter Container for Debugging kubectl exec -it document-ai-indexer -n knowledge-agent -- /bin/bash # Manually run processing kubectl exec -it document-ai-indexer -n knowledge-agent -- python main.py --config config.yaml --env env.yaml ``` ## 🔄 Update deployment ### Application update ```bash # Build new image docker build -t document-ai-indexer:v0.21.0 . # Push to repository docker tag document-ai-indexer:v0.21.0 acrsales2caiprd.azurecr.cn/document-ai-indexer:v0.21.0 docker push aacrsales2caiprd.azurecr.cn/document-ai-indexer:v0.21.0 # Update Kubernetes deployment kubectl set image cronjob/document-ai-indexer-cronjob \ document-ai-indexer=acrsales2caiprd.azurecr.cn/document-ai-indexer:v0.21.0 \ -n knowledge-agent ``` ### Configuration update ```bash # Update ConfigMap kubectl create configmap document-ai-indexer-config \ --from-file=config.yaml \ --from-file=env.yaml \ -n knowledge-agent \ --dry-run=client -o yaml | kubectl apply -f - # Restart the application (if needed) kubectl rollout restart cronjob/document-ai-indexer-cronjob -n knowledge-agent ``` --- *Last updated: August 2025*