init
This commit is contained in:
391
vw-document-ai-indexer/Deployment.md
Normal file
391
vw-document-ai-indexer/Deployment.md
Normal file
@@ -0,0 +1,391 @@
|
||||
# Document Extractor - Deployment Guide
|
||||
|
||||
This document provides a complete deployment guide for Document Extractor, including on-premises development, Docker containerized deployment, and Kubernetes production environment deployment.
|
||||
|
||||
|
||||
## 📋 Pre-deployment preparation
|
||||
|
||||
### System Requirements
|
||||
- Python 3.12+
|
||||
- Docker (optional, for containerized deployment)
|
||||
- Kubernetes (production environment deployment)
|
||||
- Azure subscription and related services
|
||||
|
||||
### Azure Service Preparation
|
||||
Ensure that you have configured the following Azure services:
|
||||
- Azure Document Intelligence
|
||||
- Azure AI Search
|
||||
- Azure Blob Storage
|
||||
- Azure OpenAI (for vector embeddings)
|
||||
|
||||
## 🔧 Configuration File Preparation
|
||||
|
||||
### 1. Environment Configuration (env.yaml)
|
||||
```yaml
|
||||
# Configuration file reference
|
||||
config: config.yaml
|
||||
|
||||
# Processing settings
|
||||
njobs: 8 # Number of parallel processing jobs
|
||||
|
||||
# Azure AI Search configuration
|
||||
search_service_name: "https://your-search-service.search.windows.net"
|
||||
search_admin_key: "your-search-admin-key"
|
||||
|
||||
# Azure OpenAI Embedding service
|
||||
embedding_model_endpoint: "https://your-openai.openai.azure.com/openai/deployments/text-embedding-3-small/embeddings?api-version=2024-12-01-preview"
|
||||
embedding_model_key: "your-openai-key"
|
||||
VECTOR_DIMENSION: 1536
|
||||
FLAG_AOAI: "V3" # Azure OpenAI version
|
||||
FLAG_EMBEDDING_MODEL: "AOAI" # Embedding model type: "AOAI" or "qwen3-embedding-8b"
|
||||
|
||||
# Document Intelligence configuration
|
||||
extract_method: "di+vision-llm" # Extraction method: "di+vision-llm", "vision-llm", "di"
|
||||
form_rec_resource: "https://your-di-service.cognitiveservices.azure.com/"
|
||||
form_rec_key: "your-di-key"
|
||||
|
||||
# Document Intelligence features
|
||||
di-hiRes: true # High resolution OCR
|
||||
di-Formulas: true # Mathematical expression detection
|
||||
di_allow_features_ext: "pdf;jpeg;jpg;png;bmp;tiff;heif" # Supported file extensions
|
||||
|
||||
# Vision and captioning models
|
||||
captioning_model_endpoint: "https://your-openai.openai.azure.com/openai/deployments/gpt-4o/chat/completions?api-version=2024-08-01-preview"
|
||||
captioning_model_key: "your-openai-key"
|
||||
vision_max_images: 200 # Maximum images to process per document (0 = no limit)
|
||||
vision_image_method: "openai" # Image processing method: "openai" or "newapi"
|
||||
FIGURE_CONTENT_CLEAR: true # Clear DI recognized image content
|
||||
|
||||
|
||||
|
||||
# Blob storage for figures and DI results
|
||||
FIGURE_BLOB_ACCOUNT_URL: "https://your-storage.blob.core.windows.net/container?sas-token"
|
||||
DI_BLOB_ACCOUNT_URL: "https://your-storage.blob.core.windows.net/container?sas-token"
|
||||
|
||||
# Database configuration
|
||||
DB_URI: "postgresql://user:password@host:port/database_name"
|
||||
|
||||
# Processing flags
|
||||
header_fix: false # Enable/disable header fixing
|
||||
```
|
||||
|
||||
### 2. Business Configuration (config.yaml)
|
||||
|
||||
```yaml
|
||||
# Main data configuration (array format)
|
||||
- data_path: "https://your-blob-storage.blob.core.windows.net/container?sas-token"
|
||||
datasource_name: "CATOnline-cn" # data source name
|
||||
data_dir: "" # Optional local data directory
|
||||
base_path: "/app/run_tmp" # Temporary processing directory
|
||||
|
||||
# File processing limits
|
||||
process_file_num: 0 # 0 = process all files
|
||||
process_file_last_modify: "2025-06-24 00:00:00" # Only process files modified after this date
|
||||
|
||||
# Chunking configuration
|
||||
chunk_size: 2048 # Maximum tokens per chunk
|
||||
token_overlap: 128 # Overlap between chunks
|
||||
|
||||
# Index schemas configuration
|
||||
index_schemas:
|
||||
# Chunk-level index for search
|
||||
- index_name: "your-knowledge-chunk-index"
|
||||
data_type: ["metadata", "document", "chunk"]
|
||||
field_type: "append" # How to handle existing data
|
||||
upload_batch_size: 50 # Documents per batch upload
|
||||
|
||||
# Metadata fields to include
|
||||
fields: [
|
||||
"filepath", "timestamp", "title", "publisher", "publish_date",
|
||||
"document_category", "document_code", "language_code",
|
||||
"x_Standard_Regulation_Id", "x_Attachment_Type",
|
||||
"x_Standard_Title_CN", "x_Standard_Title_EN",
|
||||
"x_Standard_Published_State", "x_Standard_Drafting_Status",
|
||||
"x_Standard_Range", "x_Standard_Kind", "x_Standard_No",
|
||||
"x_Standard_Code", "x_Standard_Technical_Committee",
|
||||
"x_Standard_Vehicle_Type", "x_Standard_Power_Type",
|
||||
"x_Standard_CCS", "x_Standard_ICS",
|
||||
"x_Standard_Published_Date", "x_Standard_Effective_Date",
|
||||
"x_Regulation_Status", "x_Regulation_Title_CN",
|
||||
"x_Regulation_Title_EN", "x_Regulation_Document_No",
|
||||
"x_Regulation_Issued_Date", "x_Classification",
|
||||
"x_Work_Group", "x_Reference_Standard",
|
||||
"x_Replaced_by", "x_Refer_To", "func_uuid",
|
||||
"update_time", "status"
|
||||
]
|
||||
|
||||
# Vector configuration
|
||||
vector_fields:
|
||||
- field: "contentVector"
|
||||
append_fields: ["content"] # Fields to vectorize for content
|
||||
- field: "full_metadata_vector"
|
||||
append_fields: ["full_headers", "doc_metadata"] # Metadata vectorization
|
||||
|
||||
# Azure AI Search configuration
|
||||
semantic_config_name: "default"
|
||||
vector_config_name: "vectorSearchProfile"
|
||||
update_by_field: "filepath" # Field to use for updates
|
||||
full_metadata_vector_fields: ["full_headers", "doc_metadata"]
|
||||
|
||||
# Document-level index
|
||||
- index_name: "your-knowledge-document-index"
|
||||
data_type: ["document", "metadata"]
|
||||
field_type: "full" # Replace entire documents
|
||||
key_fields: ["filepath"] # Primary key fields
|
||||
upload_batch_size: 1
|
||||
|
||||
fields: [
|
||||
# Same field list as chunk index
|
||||
"filepath", "timestamp", "title", "publisher"
|
||||
# ... (same as above)
|
||||
]
|
||||
|
||||
merge_content_fields: ["content"] # Fields to merge from chunks
|
||||
vector_fields:
|
||||
- field: "full_metadata_vector"
|
||||
append_fields: ["doc_metadata"]
|
||||
|
||||
semantic_config_name: "default"
|
||||
vector_config_name: "vectorSearchProfile"
|
||||
update_by_field: "filepath"
|
||||
|
||||
# Regulation-specific index
|
||||
- index_name: "your-regulation-index"
|
||||
data_type: ["metadata"]
|
||||
field_type: "full"
|
||||
key_fields: ["x_Standard_Regulation_Id"] # Regulation ID as key
|
||||
upload_batch_size: 50
|
||||
|
||||
fields: [
|
||||
# Regulation-specific fields
|
||||
"x_Standard_Regulation_Id", "x_Standard_Title_CN",
|
||||
"x_Standard_Title_EN", "x_Regulation_Status"
|
||||
# ... (regulation metadata fields)
|
||||
]
|
||||
|
||||
vector_fields:
|
||||
- field: "full_metadata_vector"
|
||||
append_fields: ["doc_metadata"]
|
||||
|
||||
update_by_field: "x_Standard_Regulation_Id"
|
||||
|
||||
# Field merging configuration
|
||||
merge_fields:
|
||||
- key: "doc_metadata" # Combined metadata field
|
||||
fields: [
|
||||
"title", "publisher", "document_category", "document_code",
|
||||
"x_Standard_Title_CN", "x_Standard_Title_EN",
|
||||
"x_Standard_Published_State", "x_Standard_Drafting_Status"
|
||||
# ... (all metadata fields to combine)
|
||||
]
|
||||
|
||||
# Vector field configuration
|
||||
full_metadata_vector_fields: ["full_headers", "doc_metadata"]
|
||||
```
|
||||
|
||||
## 🚀 Deployment method
|
||||
|
||||
### Method 1: Local Development Deployment
|
||||
|
||||
#### 1. Environment Preparation
|
||||
```bash
|
||||
# Clone the repository
|
||||
git clone <repository-url>
|
||||
cd document-extractor
|
||||
|
||||
# Create a virtual environment
|
||||
python -m venv .venv
|
||||
|
||||
# Activate the virtual environment
|
||||
# Linux/Mac:
|
||||
source .venv/bin/activate
|
||||
# Windows:
|
||||
.venv\Scripts\activate
|
||||
|
||||
# Install dependencies
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
|
||||
#### 2. Configuration File Setup
|
||||
```bash
|
||||
# Copy configuration templates
|
||||
cp config.yaml.example config.yaml
|
||||
cp env.yaml.example env.yaml
|
||||
|
||||
# Edit config.yaml and env.yaml to actual configuration
|
||||
```
|
||||
|
||||
#### 3. Run the application
|
||||
```bash
|
||||
# Directly run
|
||||
python main.py --config config.yaml --env env.yaml
|
||||
|
||||
```
|
||||
|
||||
|
||||
### Method 2: Kubernetes Production Deployment
|
||||
|
||||
#### 1. Build the image
|
||||
```bash
|
||||
docker build . -t document-ai-indexer:latest
|
||||
|
||||
docker tag document-ai-indexer:latest acrsales2caiprd.azurecr.cn/document-ai-indexer:latest
|
||||
|
||||
docker login acrsales2caiprd.azurecr.cn -u username -p password
|
||||
|
||||
docker push acrsales2caiprd.azurecr.cn/document-ai-indexer:latest
|
||||
```
|
||||
|
||||
|
||||
#### 2. Prepare Configuration Files
|
||||
```bash
|
||||
# Create namespace (if not exists)
|
||||
kubectl create namespace knowledge-agent
|
||||
|
||||
# Create ConfigMap
|
||||
kubectl create configmap document-ai-indexer-config \
|
||||
--from-file=config.yaml \
|
||||
--from-file=env.yaml \
|
||||
-n knowledge-agent
|
||||
```
|
||||
|
||||
#### 3. One-time Task Deployment
|
||||
```bash
|
||||
# Deploy Pod
|
||||
kubectl apply -f deploy/document-ai-indexer_k8s.yml -n knowledge-agent
|
||||
|
||||
# Check status
|
||||
kubectl get pods -n knowledge-agent
|
||||
kubectl logs -f document-ai-indexer -n knowledge-agent
|
||||
```
|
||||
|
||||
#### 4. CronJob Deployment
|
||||
```bash
|
||||
# Deploy CronJob
|
||||
kubectl apply -f deploy/document-ai-indexer-cronjob.yml -n knowledge-agent
|
||||
|
||||
# Check CronJob status
|
||||
kubectl get cronjobs -n knowledge-agent
|
||||
|
||||
# Check job history
|
||||
kubectl get jobs -n knowledge-agent
|
||||
|
||||
# Trigger execution manually
|
||||
kubectl create job --from=cronjob/document-ai-indexer-cronjob manual-test -n knowledge-agent
|
||||
```
|
||||
|
||||
## 📊 Deployment architecture diagram
|
||||
|
||||
```mermaid
|
||||
graph TB
|
||||
subgraph "Azure Cloud Services"
|
||||
ABS[Azure Blob Storage]
|
||||
ADI[Azure Document Intelligence]
|
||||
AAS[Azure AI Search]
|
||||
AOI[Azure OpenAI]
|
||||
end
|
||||
|
||||
subgraph "Kubernetes Cluster"
|
||||
subgraph "Namespace: knowledge-agent"
|
||||
CM[ConfigMap<br/>Configuration File]
|
||||
CJ[CronJob<br/>Timing tasks]
|
||||
POD[Pod<br/>Processing container]
|
||||
end
|
||||
end
|
||||
|
||||
subgraph "Container Registry"
|
||||
ACR[Azure Container Registry<br/>acrsales2caiprd.azurecr.cn]
|
||||
end
|
||||
|
||||
CM --> POD
|
||||
CJ --> POD
|
||||
ACR --> POD
|
||||
|
||||
POD --> ABS
|
||||
POD --> ADI
|
||||
POD --> AAS
|
||||
POD --> AOI
|
||||
|
||||
style POD fill:#e1f5fe
|
||||
style CM fill:#e8f5e8
|
||||
style CJ fill:#fff3e0
|
||||
```
|
||||
|
||||
|
||||
|
||||
## 📈 Monitoring and logging
|
||||
|
||||
|
||||
### View log
|
||||
```bash
|
||||
# Kubernetes environment
|
||||
kubectl logs -f document-ai-indexer -n knowledge-agent
|
||||
|
||||
# Filter error logs
|
||||
kubectl logs document-ai-indexer -n knowledge-agent | grep ERROR
|
||||
|
||||
# Check the processing progress
|
||||
kubectl logs document-ai-indexer -n knowledge-agent | grep "Processing"
|
||||
```
|
||||
|
||||
|
||||
#### 4. Kubernetes Deployment Issues
|
||||
**Symptoms**: Pod fails to start or keeps restarting
|
||||
**Solutions**:
|
||||
```bash
|
||||
# Check Pod Status
|
||||
kubectl describe pod document-ai-indexer -n knowledge-agent
|
||||
|
||||
# Check Events
|
||||
kubectl get events -n knowledge-agent
|
||||
|
||||
# Check ConfigMap
|
||||
kubectl get configmap document-ai-indexer-config -n knowledge-agent -o yaml
|
||||
```
|
||||
|
||||
### Debugging Commands
|
||||
```bash
|
||||
# Check Configuration
|
||||
kubectl exec -it document-ai-indexer -n knowledge-agent -- cat /app/config.yaml
|
||||
|
||||
# Enter Container for Debugging
|
||||
kubectl exec -it document-ai-indexer -n knowledge-agent -- /bin/bash
|
||||
|
||||
# Manually run processing
|
||||
kubectl exec -it document-ai-indexer -n knowledge-agent -- python main.py --config config.yaml --env env.yaml
|
||||
```
|
||||
|
||||
## 🔄 Update deployment
|
||||
|
||||
### Application update
|
||||
```bash
|
||||
# Build new image
|
||||
docker build -t document-ai-indexer:v0.21.0 .
|
||||
|
||||
# Push to repository
|
||||
docker tag document-ai-indexer:v0.21.0 acrsales2caiprd.azurecr.cn/document-ai-indexer:v0.21.0
|
||||
docker push aacrsales2caiprd.azurecr.cn/document-ai-indexer:v0.21.0
|
||||
|
||||
# Update Kubernetes deployment
|
||||
kubectl set image cronjob/document-ai-indexer-cronjob \
|
||||
document-ai-indexer=acrsales2caiprd.azurecr.cn/document-ai-indexer:v0.21.0 \
|
||||
-n knowledge-agent
|
||||
```
|
||||
|
||||
### Configuration update
|
||||
```bash
|
||||
# Update ConfigMap
|
||||
kubectl create configmap document-ai-indexer-config \
|
||||
--from-file=config.yaml \
|
||||
--from-file=env.yaml \
|
||||
-n knowledge-agent \
|
||||
--dry-run=client -o yaml | kubectl apply -f -
|
||||
|
||||
# Restart the application (if needed)
|
||||
kubectl rollout restart cronjob/document-ai-indexer-cronjob -n knowledge-agent
|
||||
```
|
||||
|
||||
|
||||
---
|
||||
|
||||
*Last updated: August 2025*
|
||||
Reference in New Issue
Block a user