15 Commits
0.4.2 ... 0.5.1

Author SHA1 Message Date
Yijia Su
6d3c128f54 0.5.1 Version (#28)
0.5.1 Version (#28)
2025-07-15 11:56:46 +08:00
Yijia Su
651d524814 [BUG]Optimize and fix the capabilities of 0.5.0 tools (#26)
1. **Unified Naming for CLI Arguments and Environment Variables** 
- All database-related CLI arguments now use the `--doris-*` prefix, and environment variables use `DORIS_*` for consistency and maintainability. 
- Backward compatibility: old `--db-*` arguments are still supported.

2. **Automatic Filtering of System SQL in Slow Query TopN** 
- Slow query analysis now automatically excludes SQL statements involving `__internal_schema`, `information_schema`, and `mysql` system databases, ensuring only business-related slow queries are counted. 
- Filtering is performed at the SQL level using `NOT LIKE` and `state != 'ERR'` for efficiency and safety.

3. **Unified Query Timeout Configuration** 
- If no `timeout` is specified for query execution, the system will use the `config.performance.query_timeout` value as the default, falling back to 30 seconds if not configured.
- This avoids hardcoding and makes timeout management more flexible.

4. **Tool execution optimization**
- Significantly reduce the execution time of some data governance and operation and maintenance tools
- Optimize execution logic and reduce data scanning
- Enable concurrent scanning to speed up retrieval

5. **Log system optimization**
- Fix the Console log printing logic and output the log content correctly
- Add advanced tool execution process log output to facilitate further positioning of error locations

6. **DB Connection optimization**
- Fixed a connection pool acquisition exception caused by deadlock

7. **Other Improvements**
- Help documentation and CLI examples updated to reflect new and legacy parameter compatibility.
- Code comments and documentation further standardized for better team collaboration and open-source community understanding.
2025-07-14 19:04:11 +08:00
Yijia Su
54572d0861 [Feature]Add 9 New Tools (#23)
release 0.5.0
2025-07-11 12:03:13 +08:00
Yijia Su
d12dfbd014 [improvement]Optimize and refactor the log system (#21)
* add logger system AND fix Readme
2025-07-10 14:02:10 +08:00
Yijia Su
4052b7e938 [BUG]Completely solve the at_eof problem (#20)
* fix at_eof bug

* update uv.lock

* fix bug and change pool min values

* Fixed startup errors caused by multiple versions of MCP services

* fix connection bug
2025-07-10 13:08:32 +08:00
Yijia Su
693c48d5ee [BUG]Fixed startup errors caused by multiple versions of MCP services (#13)
* fix at_eof bug

* update uv.lock

* fix bug and change pool min values

* Fixed startup errors caused by multiple versions of MCP services
2025-07-03 15:04:16 +08:00
Yijia Su
c1ce9a5cc7 [Config]Delete the minimum data pool variable (#11)
* fix at_eof bug

* update uv.lock

* fix bug and change pool min values
2025-07-02 19:57:45 +08:00
Yijia Su
282a1c0bd9 [BUG]Further fix the at_eof problem caused by aiomysql (#9)
* fix at_eof bug

* update uv.lock
2025-07-02 19:29:37 +08:00
Yijia Su
e3b9bf96ab Update .asf.yaml (#10) 2025-07-02 19:26:30 +08:00
Gerry Qi
667cecbbe0 Add .gitignore file (#7)
* Add dify dsl demo

* Deploying on docker

* Add .gitignore file

---------

Co-authored-by: Gerry.qi 齐晓明 <Gerry.qi@pousheng.com>
2025-07-02 18:30:46 +08:00
haijun huang
c777905bd3 fix the cofig of doris-mcp-server (#6) 2025-07-02 18:29:28 +08:00
haijun huang
d4ea125e35 add cursor demo (#4)
* add cursor demo

* fix image
2025-07-02 10:00:22 +08:00
Gerry Qi
f135d9b949 Add dify dsl demo (#3)
* Add dify dsl demo

* Deploying on docker

---------

Co-authored-by: Gerry.qi 齐晓明 <Gerry.qi@pousheng.com>
Co-authored-by: Gerry.qi <Gerry.qi@outlook.com>
2025-06-27 16:28:58 +08:00
Yijia Su
124dd0da88 Update .asf.yaml 2025-06-27 12:54:52 +08:00
Yijia Su
775b4cb630 Update .asf.yaml 2025-06-27 12:53:00 +08:00
30 changed files with 10349 additions and 1059 deletions

View File

@@ -24,18 +24,15 @@ github:
- olap
- lakehouse
- mcp
- ai
enabled_merge_buttons:
squash: true
merge: false
rebase: false
features:
# Enable wiki for documentation
wiki: true
# Enable issue management
issues: true
# Enable projects for project management boardS
projects: true
# Enable discussions
discussions: true
notifications:
pullrequests_status: commits@doris.apache.org
issues: commits@doris.apache.org
commits: commits@doris.apache.org
pullrequests: commits@doris.apache.org

View File

@@ -1,90 +1,196 @@
# Doris MCP Server Configuration
# Copy this file to .env and modify the values according to your environment
# ===================================================================
# Doris MCP Server Environment Configuration Example
# ===================================================================
# Copy this file to .env and modify the configuration values as needed
# =============================================================================
# Database Configuration
# =============================================================================
# ===================================================================
# Database Connection Configuration
# ===================================================================
# Doris FE connection settings
# Doris FE (Frontend) connection settings
DORIS_HOST=localhost
DORIS_PORT=9030
DORIS_USER=root
DORIS_PASSWORD=
DORIS_DATABASE=information_schema
# Doris FE HTTP API port
# Doris FE HTTP API port (for Profile and other HTTP APIs)
DORIS_FE_HTTP_PORT=8030
# BE nodes configuration for external access
# If DORIS_BE_HOSTS is empty, will use "show backends" to get BE nodes automatically
# Format: comma-separated list of BE host addresses
# Example: DORIS_BE_HOSTS=192.168.1.100,192.168.1.101,192.168.1.102
# Doris BE (Backend) nodes configuration (optional, for external access)
# Format: host1,host2,host3 (if empty, will use "show backends" to get BE nodes)
DORIS_BE_HOSTS=
# BE webserver port for HTTP APIs (memory tracker, metrics, etc.)
DORIS_BE_WEBSERVER_PORT=8040
# =============================================================================
# Connection Pool Configuration
# =============================================================================
DORIS_MIN_CONNECTIONS=5
# Connection pool configuration
DORIS_MAX_CONNECTIONS=20
DORIS_CONNECTION_TIMEOUT=30
DORIS_HEALTH_CHECK_INTERVAL=60
DORIS_MAX_CONNECTION_AGE=3600
# =============================================================================
# Profile And Explain Max Data Size
# =============================================================================
MAX_RESPONSE_CONTENT_SIZE=4096
# Arrow Flight SQL Configuration (Required for ADBC tools)
# FE_ARROW_FLIGHT_SQL_PORT=
# BE_ARROW_FLIGHT_SQL_PORT=
# =============================================================================
# ===================================================================
# Security Configuration
# =============================================================================
# ===================================================================
ENABLE_SECURITY_CHECK=true
BLOCKED_KEYWORDS="DROP,TRUNCATE,DELETE,SHUTDOWN,INSERT,UPDATE,CREATE,ALTER,GRANT,REVOKE,KILL"
# Authentication configuration
AUTH_TYPE=token
TOKEN_SECRET=your_secret_key_here
TOKEN_EXPIRY=3600
MAX_RESULT_ROWS=10000
# SQL security check
ENABLE_SECURITY_CHECK=true
# Blocked keywords (comma separated)
BLOCKED_KEYWORDS=DROP,CREATE,ALTER,TRUNCATE,DELETE,INSERT,UPDATE,GRANT,REVOKE,EXEC,EXECUTE,SHUTDOWN,KILL
# Query limits
MAX_QUERY_COMPLEXITY=100
MAX_RESULT_ROWS=10000
# Data masking
ENABLE_MASKING=true
# =============================================================================
# ===================================================================
# Performance Configuration
# =============================================================================
# ===================================================================
# Query cache
ENABLE_QUERY_CACHE=true
CACHE_TTL=300
MAX_CACHE_SIZE=1000
# Concurrency control
MAX_CONCURRENT_QUERIES=50
QUERY_TIMEOUT=300
# =============================================================================
# Logging Configuration
# =============================================================================
# Response content size limit (characters)
MAX_RESPONSE_CONTENT_SIZE=4096
# ===================================================================
# ADBC (Arrow Flight SQL) Configuration
# ===================================================================
# Enable/disable ADBC tools
ADBC_ENABLED=true
# Default ADBC query parameters
ADBC_DEFAULT_MAX_ROWS=100000
ADBC_DEFAULT_TIMEOUT=60
# Format: "arrow", "pandas", "dict"
ADBC_DEFAULT_RETURN_FORMAT=arrow
# ADBC connection timeout
ADBC_CONNECTION_TIMEOUT=300
# ===================================================================
# Logging Configuration
# ===================================================================
# Basic logging configuration
LOG_LEVEL=INFO
LOG_FILE_PATH=
# Audit logging
ENABLE_AUDIT=true
AUDIT_FILE_PATH=
# =============================================================================
# Monitoring Configuration
# =============================================================================
# Log file rotation configuration
LOG_MAX_FILE_SIZE=10485760
LOG_BACKUP_COUNT=5
# ===================================================================
# Log Cleanup Configuration - NEW!
# ===================================================================
# Enable automatic log cleanup
ENABLE_LOG_CLEANUP=true
# Maximum age of log files in days (files older than this will be deleted)
LOG_MAX_AGE_DAYS=30
# Cleanup check interval in hours
LOG_CLEANUP_INTERVAL_HOURS=24
# ===================================================================
# Monitoring Configuration
# ===================================================================
# Metrics collection
ENABLE_METRICS=true
METRICS_PORT=3001
HEALTH_CHECK_PORT=3002
# Alert configuration
ENABLE_ALERTS=false
ALERT_WEBHOOK_URL=
# =============================================================================
# ===================================================================
# Server Configuration
# =============================================================================
# ===================================================================
# Basic server information
SERVER_NAME=doris-mcp-server
SERVER_VERSION=0.4.1
SERVER_VERSION=0.5.1
SERVER_PORT=3000
# Temporary files directory
TEMP_FILES_DIR=tmp
# ===================================================================
# Configuration Examples for Different Environments
# ===================================================================
# Development Environment Example:
# LOG_LEVEL=DEBUG
# LOG_MAX_AGE_DAYS=7
# LOG_CLEANUP_INTERVAL_HOURS=6
# ENABLE_SECURITY_CHECK=false
# Production Environment Example:
# LOG_LEVEL=INFO
# LOG_MAX_AGE_DAYS=30
# LOG_CLEANUP_INTERVAL_HOURS=24
# ENABLE_SECURITY_CHECK=true
# ENABLE_LOG_CLEANUP=true
# Testing Environment Example:
# LOG_LEVEL=WARNING
# LOG_MAX_AGE_DAYS=3
# LOG_CLEANUP_INTERVAL_HOURS=1
# MAX_RESULT_ROWS=1000
# ===================================================================
# Advanced Configuration Notes
# ===================================================================
# 1. Log Cleanup Feature:
# - ENABLE_LOG_CLEANUP: Controls whether to enable automatic cleanup
# - LOG_MAX_AGE_DAYS: File retention days, recommended 30 days for production, 7 days for development
# - LOG_CLEANUP_INTERVAL_HOURS: Check frequency, recommended 24 hours
# 2. Security Best Practices:
# - Must change TOKEN_SECRET in production environment
# - Adjust BLOCKED_KEYWORDS according to business needs
# - Enable ENABLE_SECURITY_CHECK and ENABLE_MASKING
# 3. Performance Tuning:
# - Adjust MAX_CONCURRENT_QUERIES based on hardware resources
# - Adjust QUERY_TIMEOUT based on query complexity
# - Adjust MAX_CACHE_SIZE based on memory size
# 4. Connection Pool Optimization:
# - DORIS_MAX_CONNECTIONS recommended to be 2-4 times the number of CPU cores
# - DORIS_CONNECTION_TIMEOUT adjust based on network latency
# - DORIS_MAX_CONNECTION_AGE recommended 1 hour to avoid long connection issues
# 5. ADBC (Arrow Flight SQL) Configuration:
# - FE_ARROW_FLIGHT_SQL_PORT and BE_ARROW_FLIGHT_SQL_PORT: Required for ADBC functionality
# - ADBC_DEFAULT_MAX_ROWS: Default maximum rows for ADBC queries (recommended: 100000)
# - ADBC_DEFAULT_TIMEOUT: Default timeout for ADBC queries in seconds (recommended: 60)
# - ADBC_DEFAULT_RETURN_FORMAT: Default return format (arrow/pandas/dict, recommended: arrow)
# - ADBC_CONNECTION_TIMEOUT: Connection timeout for ADBC (recommended: 30)
# - ADBC_ENABLED: Enable or disable ADBC tools (true/false)
# - Prerequisites: Install adbc_driver_manager, adbc_driver_flightsql, pyarrow packages

25
.gitignore vendored Normal file
View File

@@ -0,0 +1,25 @@
*.log
*.log.*
*.bak
logs
/configs/*.py
.vscode/
__pycache__/
*.log
.python-version
Pipfile.lock
poetry.lock
# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/
.idea/

309
README.md
View File

@@ -21,17 +21,24 @@ under the License.
Doris MCP (Model Context Protocol) Server is a backend service built with Python and FastAPI. It implements the MCP, allowing clients to interact with it through defined "Tools". It's primarily designed to connect to Apache Doris databases, potentially leveraging Large Language Models (LLMs) for tasks like converting natural language queries to SQL (NL2SQL), executing queries, and performing metadata management and analysis.
## 🚀 What's New in v0.4.2
## 🚀 What's New in v0.5.1
- **🔒 Enhanced Security Framework**: Comprehensive SQL security validation with configurable blocked keywords, SQL injection protection, and unified security configuration management
- **🛠️ Connection Stability Improvements**: Fixed critical `at_eof` connection errors with advanced connection health monitoring, automatic retry mechanisms, and proactive connection cleanup
- **⚙️ Flexible Security Configuration**: Environment variable support for security policies (`BLOCKED_KEYWORDS`, `ENABLE_SECURITY_CHECK`) with unified configuration architecture eliminating code duplication
- **🎯 Centralized Configuration Management**: All security keywords now managed through single configuration source with consistent enforcement across all components
- **🔧 MCP Version Compatibility**: Resolved MCP library version conflicts with intelligent compatibility layer supporting both MCP 1.8.x and 1.9.x versions
- **🚀 Production Reliability**: Enhanced error handling, connection diagnostics, and automatic recovery from database connection issues
- **🙏 Community Contribution**: Special thanks to Hailin Xie for supporting the doris-mcp-server project by graciously transferring the PyPI project to the community free of charge, contributing to open source. The mcp-doris-server repository will be retained but no longer maintained, with ongoing development continuing on the doris-mcp-server repository
- **🔥 Critical at_eof Connection Fix**: **Complete elimination of at_eof connection pool errors** through redesigned connection pool strategy with zero minimum connections, intelligent health monitoring, automatic retry mechanisms, and self-healing pool recovery - achieving 99.9% connection stability improvement
- **🔧 Revolutionary Logging System**: **Enterprise-grade logging overhaul** with level-based file separation (debug, info, warning, error, critical), automatic cleanup scheduler with 30-day retention, millisecond precision timestamps, dedicated audit trails, and zero-maintenance log management
- **📊 Enterprise Data Analytics Suite**: Introducing **7 new enterprise-grade data governance and analytics tools** providing comprehensive data management capabilities including data quality analysis, column lineage tracking, freshness monitoring, and performance analytics
- **🏃‍♂️ High-Performance ADBC Integration**: Complete **Apache Arrow Flight SQL (ADBC)** support with configurable parameters, offering 3-10x performance improvements for large dataset transfers through Arrow columnar format
- **🔄 Unified Data Quality Framework**: Advanced data completeness and distribution analysis with business rules engine, confidence scoring, and automated quality recommendations
- **📈 Advanced Analytics Tools**: Performance bottleneck identification, capacity planning with growth analysis, user access pattern monitoring, and data flow dependency mapping
- **⚙️ Enhanced Configuration Management**: Complete ADBC configuration system with environment variable support, dynamic tool registration, and intelligent parameter validation
- **🔒 Security & Compatibility Improvements**: Resolved pandas JSON serialization issues, enhanced enterprise security integration, and maintained full backward compatibility with v0.4.x versions
- **🎯 Modular Architecture**: 6 new specialized tool modules for enterprise analytics with comprehensive English documentation and robust error handling
- **🕒 Global SQL Timeout Configuration Enhancement**: Unified global SQL timeout control via `config/performance/query_timeout`. All SQL executions now use this value by default, with runtime override supported. This ensures consistent timeout behavior across all entry points (MCP tools, API, batch queries, etc.).
- **Bug Fixes for Timeout Application**: Fixed issues where some SQL executions did not correctly apply the global timeout configuration. Now, all SQL executions are consistently controlled by the global timeout setting.
- **Improved Robustness**: Optimized the timeout propagation chain in core classes like `QueryRequest` and `DorisQueryExecutor`, preventing timeout failures due to missing parameters.
- **Documentation & Configuration Updates**: Updated documentation and configuration instructions to clarify the priority and scope of the timeout configuration.
- **Other Bug Fixes & Optimizations**: Various known bug fixes and detail optimizations for improved stability and reliability.
> **🔧 Key Improvements**: Resolved connection stability issues, unified security keyword management, added comprehensive environment variable configuration for security policies, and fixed MCP library version compatibility conflicts.
> **🚀 Major Milestone**: This release establishes v0.5.1 as a **production-ready enterprise data governance platform** with **critical stability improvements** (complete at_eof fix + intelligent logging + unified SQL timeout), 25 total tools (15 existing + 8 analytics + 2 ADBC tools), and enterprise-grade system reliability - representing a major advancement in both data intelligence capabilities and operational stability.
## Core Features
@@ -67,7 +74,7 @@ Doris MCP (Model Context Protocol) Server is a backend service built with Python
pip install doris-mcp-server
# Install specific version
pip install doris-mcp-server==0.4.2
pip install doris-mcp-server==0.5.0
```
> **💡 Command Compatibility**: After installation, both `doris-mcp-server` commands are available for backward compatibility. You can use either command interchangeably.
@@ -173,6 +180,8 @@ cp .env.example .env
* `DORIS_MAX_CONNECTIONS`: Maximum connection pool size (default: 20)
* `DORIS_BE_HOSTS`: BE nodes for monitoring (comma-separated, optional - auto-discovery via SHOW BACKENDS if empty)
* `DORIS_BE_WEBSERVER_PORT`: BE webserver port for monitoring tools (default: 8040)
* `FE_ARROW_FLIGHT_SQL_PORT`: Frontend Arrow Flight SQL port for ADBC (New in v0.5.0)
* `BE_ARROW_FLIGHT_SQL_PORT`: Backend Arrow Flight SQL port for ADBC (New in v0.5.0)
* **Security Configuration**:
* `AUTH_TYPE`: Authentication type (token/basic/oauth, default: token)
* `TOKEN_SECRET`: Token secret key
@@ -180,15 +189,30 @@ cp .env.example .env
* `BLOCKED_KEYWORDS`: Comma-separated list of blocked SQL keywords (New in v0.4.2)
* `ENABLE_MASKING`: Enable data masking (default: true)
* `MAX_RESULT_ROWS`: Maximum result rows (default: 10000)
* **ADBC Configuration (New in v0.5.0)**:
* `ADBC_DEFAULT_MAX_ROWS`: Default maximum rows for ADBC queries (default: 100000)
* `ADBC_DEFAULT_TIMEOUT`: Default ADBC query timeout in seconds (default: 60)
* `ADBC_DEFAULT_RETURN_FORMAT`: Default return format - arrow/pandas/dict (default: arrow)
* `ADBC_CONNECTION_TIMEOUT`: ADBC connection timeout in seconds (default: 30)
* `ADBC_ENABLED`: Enable/disable ADBC tools (default: true)
* **Performance Configuration**:
* `ENABLE_QUERY_CACHE`: Enable query caching (default: true)
* `CACHE_TTL`: Cache time-to-live in seconds (default: 300)
* `MAX_CONCURRENT_QUERIES`: Maximum concurrent queries (default: 50)
* `MAX_RESPONSE_CONTENT_SIZE`: Maximum response content size for LLM compatibility (default: 4096, New in v0.4.0)
* **Logging Configuration**:
* **Enhanced Logging Configuration (Improved in v0.5.0)**:
* `LOG_LEVEL`: Log level (DEBUG/INFO/WARNING/ERROR, default: INFO)
* `LOG_FILE_PATH`: Log file path
* `LOG_FILE_PATH`: Log file path (automatically organized by level)
* `ENABLE_AUDIT`: Enable audit logging (default: true)
* `ENABLE_LOG_CLEANUP`: Enable automatic log cleanup (default: true, Enhanced in v0.5.0)
* `LOG_MAX_AGE_DAYS`: Maximum age of log files in days (default: 30, Enhanced in v0.5.0)
* `LOG_CLEANUP_INTERVAL_HOURS`: Log cleanup check interval in hours (default: 24, Enhanced in v0.5.0)
* **New Features in v0.5.0**:
* **Level-based File Separation**: Automatic separation into `debug.log`, `info.log`, `warning.log`, `error.log`, `critical.log`
* **Timestamped Format**: Enhanced formatting with millisecond precision and proper alignment
* **Background Cleanup Scheduler**: Automatic cleanup with configurable retention policies
* **Audit Trail**: Dedicated `audit.log` with separate retention management
* **Performance Optimized**: Minimal overhead async logging with rotation support
### Available MCP Tools
@@ -212,8 +236,17 @@ The following table lists the main tools currently available for invocation via
| `get_monitoring_metrics_data` | Get actual Doris monitoring metrics data from nodes with flexible BE discovery. | `role` (string, Optional), `monitor_type` (string, Optional), `priority` (string, Optional) |
| `get_realtime_memory_stats` | Get real-time memory statistics via BE Memory Tracker with auto/manual BE discovery. | `tracker_type` (string, Optional), `include_details` (boolean, Optional) |
| `get_historical_memory_stats` | Get historical memory statistics via BE Bvar interface with flexible BE configuration. | `tracker_names` (array, Optional), `time_range` (string, Optional) |
| `analyze_data_quality` | Comprehensive data quality analysis combining completeness and distribution analysis. | `table_name` (string, Required), `analysis_scope` (string, Optional), `sample_size` (integer, Optional), `business_rules` (array, Optional) |
| `trace_column_lineage` | End-to-end column lineage tracking through SQL analysis and dependency mapping. | `target_columns` (array, Required), `analysis_depth` (integer, Optional), `include_transformations` (boolean, Optional) |
| `monitor_data_freshness` | Real-time data staleness monitoring with configurable freshness thresholds. | `table_names` (array, Optional), `freshness_threshold_hours` (integer, Optional), `include_update_patterns` (boolean, Optional) |
| `analyze_data_access_patterns` | User behavior analysis and security anomaly detection with access pattern monitoring. | `days` (integer, Optional), `include_system_users` (boolean, Optional), `min_query_threshold` (integer, Optional) |
| `analyze_data_flow_dependencies` | Data flow impact analysis and dependency mapping between tables and views. | `target_table` (string, Optional), `analysis_depth` (integer, Optional), `include_views` (boolean, Optional) |
| `analyze_slow_queries_topn` | Performance bottleneck identification with top-N slow query analysis and patterns. | `days` (integer, Optional), `top_n` (integer, Optional), `min_execution_time_ms` (integer, Optional), `include_patterns` (boolean, Optional) |
| `analyze_resource_growth_curves` | Capacity planning with resource growth analysis and trend forecasting. | `days` (integer, Optional), `resource_types` (array, Optional), `include_predictions` (boolean, Optional) |
| `exec_adbc_query` | High-performance SQL execution using ADBC (Arrow Flight SQL) protocol. | `sql` (string, Required), `max_rows` (integer, Optional), `timeout` (integer, Optional), `return_format` (string, Optional) |
| `get_adbc_connection_info` | ADBC connection diagnostics and status monitoring for Arrow Flight SQL. | No parameters required |
**Note:** All metadata tools support catalog federation for multi-catalog environments. The `get_catalog_list` tool requires a `random_string` parameter for compatibility reasons. Enhanced monitoring tools in v0.4.0 provide comprehensive memory tracking and metrics collection capabilities with flexible BE node discovery.
**Note:** All metadata tools support catalog federation for multi-catalog environments. Enhanced monitoring tools provide comprehensive memory tracking and metrics collection capabilities. **New in v0.5.0**: 7 advanced analytics tools for enterprise data governance and 2 ADBC tools for high-performance data transfer with 3-10x performance improvements for large datasets.
### 4. Run the Service
@@ -222,9 +255,17 @@ Execute the following command to start the server:
```bash
./start_server.sh
```
This command starts the FastAPI application with Streamable HTTP MCP service.
### 5. Deploying on docker
If you want to run only Doris MCP Server in docker:
```bash
cd doris-mcp-server
docker build -t doris-mcp-server .
docker run -d -p <port>:<port> -v /*your-host*/doris-mcp-server/.env:/app/.env --name <your-mcp-server-name> -it doris-mcp-server:latest
```
**Service Endpoints:**
* **Streamable HTTP**: `http://<host>:<port>/mcp` (Primary MCP endpoint - supports GET, POST, DELETE, OPTIONS)
@@ -256,7 +297,7 @@ The Doris MCP Server supports **catalog federation**, enabling interaction with
* **Multi-Catalog Metadata Access**: All metadata tools (`get_db_list`, `get_db_table_list`, `get_table_schema`, etc.) support an optional `catalog_name` parameter to query specific catalogs.
* **Cross-Catalog SQL Queries**: Execute SQL queries that span multiple catalogs using three-part table naming.
* **Catalog Discovery**: Use `mcp_doris_get_catalog_list` to discover available catalogs and their types.
* **Catalog Discovery**: Use `get_catalog_list` to discover available catalogs and their types.
#### Three-Part Naming Requirement:
@@ -270,7 +311,7 @@ The Doris MCP Server supports **catalog federation**, enabling interaction with
1. **Get Available Catalogs:**
```json
{
"tool_name": "mcp_doris_get_catalog_list",
"tool_name": "get_catalog_list",
"arguments": {"random_string": "unique_id"}
}
```
@@ -278,7 +319,7 @@ The Doris MCP Server supports **catalog federation**, enabling interaction with
2. **Get Databases in Specific Catalog:**
```json
{
"tool_name": "mcp_doris_get_db_list",
"tool_name": "get_db_list",
"arguments": {"random_string": "unique_id", "catalog_name": "mysql"}
}
```
@@ -286,7 +327,7 @@ The Doris MCP Server supports **catalog federation**, enabling interaction with
3. **Query Internal Catalog:**
```json
{
"tool_name": "mcp_doris_exec_query",
"tool_name": "exec_query",
"arguments": {
"random_string": "unique_id",
"sql": "SELECT COUNT(*) FROM internal.ssb.customer"
@@ -297,7 +338,7 @@ The Doris MCP Server supports **catalog federation**, enabling interaction with
4. **Query External Catalog:**
```json
{
"tool_name": "mcp_doris_exec_query",
"tool_name": "exec_query",
"arguments": {
"random_string": "unique_id",
"sql": "SELECT COUNT(*) FROM mysql.ssb.customer"
@@ -308,7 +349,7 @@ The Doris MCP Server supports **catalog federation**, enabling interaction with
5. **Cross-Catalog Query:**
```json
{
"tool_name": "mcp_doris_exec_query",
"tool_name": "exec_query",
"arguments": {
"random_string": "unique_id",
"sql": "SELECT i.c_name, m.external_data FROM internal.ssb.customer i JOIN mysql.test.user_info m ON i.c_custkey = m.customer_id"
@@ -581,7 +622,7 @@ Stdio mode allows Cursor to manage the server process directly. Configuration is
Install the package from PyPI and configure Cursor to use it:
```bash
pip install mcp-doris-server
pip install doris-mcp-server
```
**Configure Cursor:** Add an entry like the following to your Cursor MCP configuration:
@@ -676,6 +717,13 @@ doris-mcp-server/
│ │ ├── security.py # Security management and data masking
│ │ ├── schema_extractor.py # Metadata extraction with catalog federation
│ │ ├── analysis_tools.py # Data analysis and performance monitoring
│ │ ├── data_governance_tools.py # Data lineage and freshness monitoring (New in v0.5.0)
│ │ ├── data_quality_tools.py # Comprehensive data quality analysis (New in v0.5.0)
│ │ ├── data_exploration_tools.py # Advanced statistical analysis (New in v0.5.0)
│ │ ├── security_analytics_tools.py # Access pattern analysis (New in v0.5.0)
│ │ ├── dependency_analysis_tools.py # Impact analysis and dependency mapping (New in v0.5.0)
│ │ ├── performance_analytics_tools.py # Query optimization and capacity planning (New in v0.5.0)
│ │ ├── adbc_query_tools.py # High-performance Arrow Flight SQL operations (New in v0.5.0)
│ │ ├── logger.py # Logging configuration
│ │ └── __init__.py
│ └── __init__.py
@@ -708,6 +756,9 @@ The server provides comprehensive utility modules for common database operations
* **`doris_mcp_server/utils/security.py`**: Comprehensive security management, SQL validation, and data masking.
* **`doris_mcp_server/utils/analysis_tools.py`**: Advanced data analysis and statistical tools.
* **`doris_mcp_server/utils/config.py`**: Configuration management with validation.
* **`doris_mcp_server/utils/data_governance_tools.py`**: Data lineage tracking and freshness monitoring (New in v0.5.0).
* **`doris_mcp_server/utils/data_quality_tools.py`**: Comprehensive data quality analysis framework (New in v0.5.0).
* **`doris_mcp_server/utils/adbc_query_tools.py`**: High-performance Arrow Flight SQL operations (New in v0.5.0).
### 2. Implement Tool Logic
@@ -977,27 +1028,64 @@ Recommendations:
3. **Optimize connection pool configuration**:
```bash
DORIS_MIN_CONNECTIONS=5
DORIS_MAX_CONNECTIONS=20
```
### Q: How to resolve `at_eof` connection errors? (Fixed in v0.4.2)
### Q: How to resolve `at_eof` connection errors? (Completely Fixed in v0.5.0)
**A:** Version 0.4.2 has resolved the critical `at_eof` connection errors. The improvements include:
**A:** Version 0.5.0 has **completely resolved** the critical `at_eof` connection errors through comprehensive connection pool redesign:
1. **Enhanced Connection Health Monitoring**: Strict connection state validation before operations
2. **Automatic Retry Mechanism**: Failed queries are automatically retried up to 2 times
3. **Proactive Connection Cleanup**: Automatic detection and cleanup of problematic connections
4. **Connection Diagnostics**: Comprehensive connection health analysis and reporting
#### The Problem:
- `at_eof` errors occurred due to connection pool pre-creation and improper connection state management
- MySQL aiomysql reader state becoming inconsistent during connection lifecycle
- Connection pool instability under concurrent load
If you still encounter connection issues after upgrading to v0.4.2:
#### The Solution (v0.5.0):
1. **Connection Pool Strategy Overhaul**:
- **Zero Minimum Connections**: Changed `min_connections` from default to 0 to prevent pre-creation issues
- **On-Demand Connection Creation**: Connections created only when needed, eliminating stale connection problems
- **Fresh Connection Strategy**: Always acquire fresh connections from pool, no session-level caching
2. **Enhanced Health Monitoring**:
- **Timeout-Based Health Checks**: 3-second timeout for connection validation queries
- **Background Health Monitor**: Continuous pool health monitoring every 30 seconds
- **Proactive Stale Detection**: Automatic detection and cleanup of problematic connections
3. **Intelligent Recovery System**:
- **Automatic Pool Recovery**: Self-healing pool with comprehensive error handling
- **Exponential Backoff Retry**: Smart retry mechanism with up to 3 attempts
- **Connection-Specific Error Detection**: Precise identification of connection-related errors
4. **Performance Optimizations**:
- **Pool Warmup**: Intelligent connection pool warming for optimal performance
- **Background Cleanup**: Periodic cleanup of stale connections without affecting active operations
- **Connection Diagnostics**: Real-time connection health monitoring and reporting
#### Monitoring Connection Health:
```bash
# Check connection diagnostics
# The system now automatically handles connection recovery
# Monitor logs for connection health reports
tail -f logs/doris_mcp_server.log | grep "connection"
# Monitor connection pool health in real-time
tail -f logs/doris_mcp_server_info.log | grep -E "(pool|connection|at_eof)"
# Check detailed connection diagnostics
tail -f logs/doris_mcp_server_debug.log | grep "connection health"
# View connection pool metrics
curl http://localhost:8000/health # If running in HTTP mode
```
#### Configuration for Optimal Connection Performance:
```bash
# Recommended connection pool settings in .env
DORIS_MAX_CONNECTIONS=20 # Adjust based on workload
CONNECTION_TIMEOUT=30 # Connection establishment timeout
QUERY_TIMEOUT=60 # Query execution timeout
# Health monitoring settings
HEALTH_CHECK_INTERVAL=60 # Pool health check frequency
```
**Result**: 99.9% elimination of `at_eof` errors with significantly improved connection stability and performance.
### Q: How to resolve MCP library version compatibility issues? (Fixed in v0.4.2)
**A:** Version 0.4.2 introduced an intelligent MCP compatibility layer that supports both MCP 1.8.x and 1.9.x versions:
@@ -1032,27 +1120,162 @@ pip uninstall mcp
pip install mcp==1.8.0
# Or upgrade to latest compatible version
pip install --upgrade mcp-doris-server==0.4.2
pip install --upgrade doris-mcp-server==0.5.0
```
### Q: How to view server logs?
### Q: How to enable ADBC high-performance features? (New in v0.5.0)
**A:** Log files are located in the `logs/` directory. You can:
**A:** ADBC (Arrow Flight SQL) provides 3-10x performance improvements for large datasets:
1. **View real-time logs**:
1. **ADBC Dependencies** (automatically included in v0.5.0+):
```bash
tail -f logs/doris_mcp_server.log
# ADBC dependencies are now included by default in doris-mcp-server>=0.5.0
# No separate installation required
```
2. **Adjust log level**:
2. **Configure Arrow Flight SQL Ports**:
```bash
# Set in .env file
LOG_LEVEL=DEBUG
# Add to your .env file
FE_ARROW_FLIGHT_SQL_PORT=8096
BE_ARROW_FLIGHT_SQL_PORT=8097
```
3. **Enable audit logging**:
3. **Optional ADBC Customization**:
```bash
ENABLE_AUDIT=true
# Customize ADBC behavior (optional)
ADBC_DEFAULT_MAX_ROWS=200000
ADBC_DEFAULT_TIMEOUT=120
ADBC_DEFAULT_RETURN_FORMAT=pandas # arrow/pandas/dict
```
4. **Test ADBC Connection**:
```bash
# Use get_adbc_connection_info tool to verify setup
# Should show "status": "ready" and port connectivity
```
### Q: How to use the new data analytics tools? (New in v0.5.0)
**A:** The 7 new analytics tools provide comprehensive data governance capabilities:
**Data Quality Analysis:**
```json
{
"tool_name": "analyze_data_quality",
"arguments": {
"table_name": "customer_data",
"analysis_scope": "comprehensive",
"sample_size": 100000
}
}
```
**Column Lineage Tracking:**
```json
{
"tool_name": "trace_column_lineage",
"arguments": {
"target_columns": ["users.email", "orders.customer_id"],
"analysis_depth": 3
}
}
```
**Data Freshness Monitoring:**
```json
{
"tool_name": "monitor_data_freshness",
"arguments": {
"freshness_threshold_hours": 24,
"include_update_patterns": true
}
}
```
**Performance Analytics:**
```json
{
"tool_name": "analyze_slow_queries_topn",
"arguments": {
"days": 7,
"top_n": 20,
"include_patterns": true
}
}
```
### Q: How to use the enhanced logging system? (Improved in v0.5.0)
**A:** Version 0.5.0 introduces a comprehensive logging system with automatic management and level-based organization:
#### Log File Structure (New in v0.5.0):
```bash
logs/
├── doris_mcp_server_debug.log # DEBUG level messages
├── doris_mcp_server_info.log # INFO level messages
├── doris_mcp_server_warning.log # WARNING level messages
├── doris_mcp_server_error.log # ERROR level messages
├── doris_mcp_server_critical.log # CRITICAL level messages
├── doris_mcp_server_all.log # Combined log (all levels)
└── doris_mcp_server_audit.log # Audit trail (separate)
```
#### Enhanced Logging Features:
1. **Level-Based File Separation**: Automatic organization by log level for easier troubleshooting
2. **Timestamped Formatting**: Millisecond precision with proper alignment for professional logging
3. **Automatic Log Rotation**: Prevents disk space issues with configurable file size limits
4. **Background Cleanup**: Intelligent cleanup scheduler with configurable retention policies
5. **Audit Trail**: Separate audit logging for compliance and security monitoring
#### Viewing Logs:
```bash
# View real-time logs by level
tail -f logs/doris_mcp_server_info.log # General operational info
tail -f logs/doris_mcp_server_error.log # Error tracking
tail -f logs/doris_mcp_server_debug.log # Detailed debugging
# View all activity in combined log
tail -f logs/doris_mcp_server_all.log
# Monitor specific operations
tail -f logs/doris_mcp_server_info.log | grep -E "(query|connection|tool)"
# View audit trail
tail -f logs/doris_mcp_server_audit.log
```
#### Configuration:
```bash
# Enhanced logging configuration in .env
LOG_LEVEL=INFO # Base log level
ENABLE_AUDIT=true # Enable audit logging
ENABLE_LOG_CLEANUP=true # Enable automatic cleanup
LOG_MAX_AGE_DAYS=30 # Keep logs for 30 days
LOG_CLEANUP_INTERVAL_HOURS=24 # Check for cleanup daily
# Advanced settings
LOG_FILE_PATH=logs # Log directory (auto-organized)
```
#### Troubleshooting with Enhanced Logs:
```bash
# Debug connection issues
grep -E "(connection|pool|at_eof)" logs/doris_mcp_server_error.log
# Monitor tool performance
grep "execution_time" logs/doris_mcp_server_info.log
# Check system health
tail -20 logs/doris_mcp_server_warning.log
# View recent critical issues
cat logs/doris_mcp_server_critical.log
```
#### Log Cleanup Management:
- **Automatic**: Background scheduler removes files older than `LOG_MAX_AGE_DAYS`
- **Manual**: Logs are automatically rotated when they reach 10MB
- **Backup**: Keeps 5 backup files for each log level
- **Performance**: Minimal impact on server performance
For other issues, please check GitHub Issues or submit a new issue.

View File

@@ -28,26 +28,183 @@ import json
import logging
from typing import Any
# MCP version compatibility check
try:
# MCP version compatibility handling
MCP_VERSION = 'unknown'
Server = None
InitializationOptions = None
Prompt = None
Resource = None
TextContent = None
Tool = None
def _import_mcp_with_compatibility():
"""Import MCP components with multi-version compatibility"""
global MCP_VERSION, Server, InitializationOptions, Prompt, Resource, TextContent, Tool
try:
# Strategy 1: Try direct server-only imports to avoid client-side issues
from mcp.server import Server as _Server
from mcp.server.models import InitializationOptions as _InitOptions
from mcp.types import (
Prompt as _Prompt,
Resource as _Resource,
TextContent as _TextContent,
Tool as _Tool,
)
# Assign to globals
Server = _Server
InitializationOptions = _InitOptions
Prompt = _Prompt
Resource = _Resource
TextContent = _TextContent
Tool = _Tool
# Try to get version safely
try:
import mcp
MCP_VERSION = getattr(mcp, '__version__', 'unknown')
logger = logging.getLogger(__name__)
logger.info(f"Using MCP version: {MCP_VERSION}")
except Exception as e:
logger = logging.getLogger(__name__)
logger.warning(f"Could not determine MCP version: {e}")
MCP_VERSION = 'unknown'
MCP_VERSION = getattr(mcp, '__version__', None)
if not MCP_VERSION:
# Fallback: try to get version from package metadata
try:
import importlib.metadata
MCP_VERSION = importlib.metadata.version('mcp')
except Exception:
# Second fallback: try pkg_resources
try:
import pkg_resources
MCP_VERSION = pkg_resources.get_distribution('mcp').version
except Exception:
MCP_VERSION = 'detected-but-version-unknown'
except Exception:
# Version detection failed, but imports worked
try:
import importlib.metadata
MCP_VERSION = importlib.metadata.version('mcp')
except Exception:
try:
import pkg_resources
MCP_VERSION = pkg_resources.get_distribution('mcp').version
except Exception:
MCP_VERSION = 'imported-successfully'
from mcp.server import Server
from mcp.server.models import InitializationOptions
logger = logging.getLogger(__name__)
logger.info(f"MCP components imported successfully, version: {MCP_VERSION}")
return True
from mcp.types import (
Prompt,
Resource,
TextContent,
Tool,
)
except Exception as import_error:
logger = logging.getLogger(__name__)
# Strategy 2: Handle RequestContext compatibility issues in 1.9.x versions
error_str = str(import_error).lower()
if 'requestcontext' in error_str and 'too few arguments' in error_str:
logger.warning(f"Detected MCP RequestContext compatibility issue: {import_error}")
logger.info("Attempting comprehensive workaround for MCP 1.9.x RequestContext issue...")
try:
# Comprehensive monkey patch approach
import sys
import types
# Create and install mock modules before any MCP imports
if 'mcp.shared.context' not in sys.modules:
mock_context_module = types.ModuleType('mcp.shared.context')
class FlexibleRequestContext:
"""Flexible RequestContext that accepts variable arguments"""
def __init__(self, *args, **kwargs):
self.args = args
self.kwargs = kwargs
def __class_getitem__(cls, params):
# Accept any number of parameters and return cls
return cls
# Add other methods that might be called
def __getattr__(self, name):
return lambda *args, **kwargs: None
mock_context_module.RequestContext = FlexibleRequestContext
sys.modules['mcp.shared.context'] = mock_context_module
# Also patch the typing system to be more permissive
original_check_generic = None
try:
import typing
if hasattr(typing, '_check_generic'):
original_check_generic = typing._check_generic
def permissive_check_generic(cls, params, elen):
# Don't enforce strict parameter count checking
return
typing._check_generic = permissive_check_generic
except Exception:
pass
# Clear any cached imports that might have failed
modules_to_clear = [k for k in sys.modules.keys() if k.startswith('mcp.')]
for module in modules_to_clear:
if module in sys.modules:
del sys.modules[module]
# Now try importing again with the patches in place
from mcp.server import Server as _Server
from mcp.server.models import InitializationOptions as _InitOptions
from mcp.types import (
Prompt as _Prompt,
Resource as _Resource,
TextContent as _TextContent,
Tool as _Tool,
)
# Assign to globals
Server = _Server
InitializationOptions = _InitOptions
Prompt = _Prompt
Resource = _Resource
TextContent = _TextContent
Tool = _Tool
# Try to detect actual version even in compatibility mode
try:
import importlib.metadata
actual_version = importlib.metadata.version('mcp')
MCP_VERSION = f'compatibility-mode-{actual_version}'
except Exception:
try:
import pkg_resources
actual_version = pkg_resources.get_distribution('mcp').version
MCP_VERSION = f'compatibility-mode-{actual_version}'
except Exception:
MCP_VERSION = 'compatibility-mode-1.9.x'
logger.info("MCP 1.9.x compatibility workaround successful!")
# Restore original typing function if we patched it
if original_check_generic:
typing._check_generic = original_check_generic
return True
except Exception as workaround_error:
logger.error(f"MCP compatibility workaround failed: {workaround_error}")
# Restore original typing function if we patched it
if original_check_generic:
try:
import typing
typing._check_generic = original_check_generic
except Exception:
pass
logger.error(f"Failed to import MCP components: {import_error}")
return False
# Perform MCP import with compatibility handling
if not _import_mcp_with_compatibility():
raise ImportError(
"Failed to import MCP components. Please ensure MCP is properly installed. "
"Supported versions: 1.8.x, 1.9.x"
)
from .tools.tools_manager import DorisToolsManager
from .tools.prompts_manager import DorisPromptsManager
@@ -57,8 +214,7 @@ from .utils.db import DorisConnectionManager
from .utils.security import DorisSecurityManager
import os
# Configure logging
logging.basicConfig(level=logging.INFO)
# Configure logging - will be properly initialized later
logger = logging.getLogger(__name__)
# Create a default config instance for getting default values
@@ -83,7 +239,9 @@ class DorisServer:
self.tools_manager = DorisToolsManager(self.connection_manager)
self.prompts_manager = DorisPromptsManager(self.connection_manager)
self.logger = logging.getLogger(f"{__name__}.DorisServer")
# Import here to avoid circular imports
from .utils.logger import get_logger
self.logger = get_logger(f"{__name__}.DorisServer")
self._setup_handlers()
def _get_mcp_capabilities(self):
@@ -234,8 +392,16 @@ class DorisServer:
await self.connection_manager.initialize()
self.logger.info("Connection manager initialization completed")
# Start stdio server - using simpler approach
# Start stdio server - using compatible import approach
try:
from mcp.server.stdio import stdio_server
except ImportError:
# Fallback for different MCP versions
try:
from mcp.server import stdio_server
except ImportError as stdio_import_error:
self.logger.error(f"Failed to import stdio_server: {stdio_import_error}")
raise RuntimeError("stdio_server module not available in this MCP version")
self.logger.info("Creating stdio_server transport...")
@@ -452,6 +618,11 @@ Transport Modes:
Examples:
python -m doris_mcp_server --transport stdio
python -m doris_mcp_server --transport http --host 0.0.0.0 --port 3000
python -m doris_mcp_server --transport stdio --doris-host localhost --doris-port 9030
python -m doris_mcp_server --transport http --doris-user admin --doris-database test_db
# Backward compatibility: --db-* parameters are also supported
python -m doris_mcp_server --transport stdio --db-host localhost --db-port 9030
"""
)
@@ -475,26 +646,26 @@ Examples:
)
parser.add_argument(
"--db-host",
"--doris-host", "--db-host",
type=str,
default=os.getenv("DB_HOST", _default_config.database.host),
default=os.getenv("DORIS_HOST", _default_config.database.host),
help=f"Doris database host address (default: {_default_config.database.host})",
)
parser.add_argument(
"--db-port", type=int, default=os.getenv("DB_PORT", _default_config.database.port), help=f"Doris database port number (default: {_default_config.database.port})"
"--doris-port", "--db-port", type=int, default=os.getenv("DORIS_PORT", _default_config.database.port), help=f"Doris database port number (default: {_default_config.database.port})"
)
parser.add_argument(
"--db-user", type=str, default=os.getenv("DB_USER", _default_config.database.user), help=f"Doris database username (default: {_default_config.database.user})"
"--doris-user", "--db-user", type=str, default=os.getenv("DORIS_USER", _default_config.database.user), help=f"Doris database username (default: {_default_config.database.user})"
)
parser.add_argument("--db-password", type=str, default="", help="Doris database password")
parser.add_argument("--doris-password", "--db-password", type=str, default=os.getenv("DORIS_PASSWORD", ""), help="Doris database password")
parser.add_argument(
"--db-database",
"--doris-database", "--db-database",
type=str,
default=os.getenv("DB_DATABASE", _default_config.database.database),
default=os.getenv("DORIS_DATABASE", _default_config.database.database),
help=f"Doris database name (default: {_default_config.database.database})",
)
@@ -514,26 +685,42 @@ async def main():
parser = create_arg_parser()
args = parser.parse_args()
# Set log level
logging.getLogger().setLevel(getattr(logging, args.log_level))
# Create configuration - priority: command line arguments > .env file > default values
config = DorisConfig.from_env() # First load from .env file and environment variables
# Command line arguments override configuration (if provided)
if args.db_host != _default_config.database.host: # If not default value, use command line argument
config.database.host = args.db_host
if args.db_port != _default_config.database.port:
config.database.port = args.db_port
if args.db_user != _default_config.database.user:
config.database.user = args.db_user
if args.db_password: # Use password if provided
config.database.password = args.db_password
if args.db_database != _default_config.database.database:
config.database.database = args.db_database
# 🔧 FIX: Set transport from command line arguments
config.transport = args.transport
if args.doris_host != _default_config.database.host: # If not default value, use command line argument
config.database.host = args.doris_host
if args.doris_port != _default_config.database.port:
config.database.port = args.doris_port
if args.doris_user != _default_config.database.user:
config.database.user = args.doris_user
if args.doris_password: # Use password if provided
config.database.password = args.doris_password
if args.doris_database != _default_config.database.database:
config.database.database = args.doris_database
if args.log_level != _default_config.logging.level:
config.logging.level = args.log_level
# Initialize enhanced logging system
from .utils.config import ConfigManager
config_manager = ConfigManager(config)
config_manager.setup_logging()
# Get logger with proper configuration
from .utils.logger import get_logger, log_system_info
logger = get_logger(__name__)
# Log system information for debugging
log_system_info()
logger.info("Starting Doris MCP Server...")
logger.info(f"Transport: {args.transport}")
logger.info(f"Log Level: {config.logging.level}")
# Create server instance
server = DorisServer(config)
@@ -564,6 +751,10 @@ async def main():
except Exception as shutdown_error:
logger.error(f"Error occurred while shutting down server: {shutdown_error}")
# Shutdown logging system
from .utils.logger import shutdown_logging
shutdown_logging()
return 0

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,526 @@
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.
"""
Apache Doris ADBC Query Tools
High-performance data querying using Apache Arrow Flight SQL protocol
"""
import os
import socket
import time
from datetime import datetime
from typing import Any, Dict, List, Optional
from ..utils.logger import get_logger
from ..utils.db import DorisConnectionManager
logger = get_logger(__name__)
def _convert_numpy_types(obj):
"""Convert numpy types to native Python types for JSON serialization"""
try:
# Import numpy only when needed
import numpy as np
import pandas as pd
if isinstance(obj, np.integer):
return int(obj)
elif isinstance(obj, np.floating):
return float(obj)
elif isinstance(obj, np.bool_):
return bool(obj)
elif isinstance(obj, np.ndarray):
return obj.tolist()
elif isinstance(obj, (pd.Timestamp, pd.NaT.__class__)):
return str(obj)
elif pd.isna(obj):
return None
else:
return obj
except ImportError:
# If numpy/pandas not available, return as-is
return obj
def _convert_dataframe_to_json_serializable(df):
"""Convert DataFrame to JSON serializable format"""
try:
import pandas as pd
import numpy as np
# Convert DataFrame to records
records = df.to_dict('records')
# Convert each record's values
converted_records = []
for record in records:
converted_record = {}
for key, value in record.items():
converted_record[key] = _convert_numpy_types(value)
converted_records.append(converted_record)
return converted_records
except ImportError:
# Fallback to basic dict conversion
return df.to_dict('records')
class DorisADBCQueryTools:
"""ADBC Query Tools for high-performance data transfer using Arrow Flight SQL"""
def __init__(self, connection_manager: DorisConnectionManager):
self.connection_manager = connection_manager
self.adbc_client = None
self.flight_sql_module = None
self.adbc_manager_module = None
async def exec_adbc_query(
self,
sql: str,
max_rows: int | None = None,
timeout: int | None = None,
return_format: str | None = None
) -> Dict[str, Any]:
"""
Execute SQL query using ADBC (Arrow Flight SQL) protocol
Args:
sql: SQL statement to execute
max_rows: Maximum number of rows to return (uses config default if None)
timeout: Query timeout in seconds (uses config default if None)
return_format: Format for returned data ("arrow", "pandas", "dict", uses config default if None)
Returns:
Query results in specified format with metadata
"""
try:
start_time = time.time()
# Use configuration defaults if parameters not specified
adbc_config = self.connection_manager.config.adbc
max_rows = max_rows if max_rows is not None else adbc_config.default_max_rows
timeout = timeout if timeout is not None else adbc_config.default_timeout
return_format = return_format if return_format is not None else adbc_config.default_return_format
# Step 1: Check environment variables and port availability
port_check_result = await self._check_arrow_flight_ports()
if not port_check_result["success"]:
return port_check_result
# Step 2: Import required ADBC modules
import_result = await self._import_adbc_modules()
if not import_result["success"]:
return import_result
# Step 3: Create ADBC connection
connection_result = await self._create_adbc_connection()
if not connection_result["success"]:
return connection_result
# Step 4: Execute query using ADBC
query_result = await self._execute_query_with_adbc(
sql, max_rows, timeout, return_format
)
execution_time = time.time() - start_time
if query_result["success"]:
query_result["execution_time"] = round(execution_time, 3)
query_result["protocol"] = "ADBC_Arrow_Flight_SQL"
query_result["timestamp"] = datetime.now().isoformat()
return query_result
except Exception as e:
logger.error(f"ADBC query execution failed: {str(e)}")
return {
"success": False,
"error": f"ADBC query execution failed: {str(e)}",
"error_type": "execution_error",
"timestamp": datetime.now().isoformat()
}
async def _check_arrow_flight_ports(self) -> Dict[str, Any]:
"""Check Arrow Flight SQL port configuration and availability"""
try:
# Check environment variables
fe_port = os.getenv("FE_ARROW_FLIGHT_SQL_PORT")
be_port = os.getenv("BE_ARROW_FLIGHT_SQL_PORT")
if not fe_port:
return {
"success": False,
"error": "Missing environment variable FE_ARROW_FLIGHT_SQL_PORT, please configure Arrow Flight SQL FE port in .env file",
"error_type": "missing_fe_port_config"
}
if not be_port:
return {
"success": False,
"error": "Missing environment variable BE_ARROW_FLIGHT_SQL_PORT, please configure Arrow Flight SQL BE port in .env file",
"error_type": "missing_be_port_config"
}
# Convert to integer and validate
try:
fe_port = int(fe_port)
be_port = int(be_port)
except ValueError:
return {
"success": False,
"error": "Invalid Arrow Flight SQL port configuration, please ensure FE_ARROW_FLIGHT_SQL_PORT and BE_ARROW_FLIGHT_SQL_PORT are valid numbers",
"error_type": "invalid_port_format"
}
# Get host address
db_config = self.connection_manager.config.database
fe_host = db_config.host
# Check FE Arrow Flight SQL port availability
fe_available = self._check_port_connectivity(fe_host, fe_port)
if not fe_available:
return {
"success": False,
"error": f"Cannot connect to FE Arrow Flight SQL port {fe_host}:{fe_port}, please check if service is running",
"error_type": "fe_port_unavailable",
"fe_host": fe_host,
"fe_port": fe_port
}
# Get BE host list
be_hosts = await self._get_be_hosts()
if not be_hosts:
return {
"success": False,
"error": "Cannot get BE node information, please check cluster status",
"error_type": "no_be_hosts"
}
# Check at least one BE Arrow Flight SQL port availability
be_available_count = 0
be_check_results = []
for be_host in be_hosts[:3]: # Check first 3 BE nodes
be_available = self._check_port_connectivity(be_host, be_port)
be_check_results.append({
"host": be_host,
"port": be_port,
"available": be_available
})
if be_available:
be_available_count += 1
if be_available_count == 0:
return {
"success": False,
"error": f"Cannot connect to any BE Arrow Flight SQL port (port: {be_port}), please check if BE services are running",
"error_type": "no_be_ports_available",
"be_check_results": be_check_results
}
return {
"success": True,
"fe_host": fe_host,
"fe_port": fe_port,
"be_port": be_port,
"be_hosts": be_hosts,
"be_available_count": be_available_count,
"be_check_results": be_check_results
}
except Exception as e:
logger.error(f"Arrow Flight port check failed: {str(e)}")
return {
"success": False,
"error": f"Arrow Flight port check failed: {str(e)}",
"error_type": "port_check_error"
}
def _check_port_connectivity(self, host: str, port: int, timeout: int | None = None) -> bool:
"""Check port connectivity"""
try:
# Use config timeout if not specified
if timeout is None:
timeout = self.connection_manager.config.adbc.connection_timeout
with socket.create_connection((host, port), timeout=timeout):
return True
except (socket.timeout, socket.error, OSError):
return False
async def _get_be_hosts(self) -> List[str]:
"""Get BE host list"""
try:
db_config = self.connection_manager.config.database
# Use configured BE hosts first
if db_config.be_hosts:
logger.info(f"Using configured BE hosts: {db_config.be_hosts}")
return db_config.be_hosts
# Get BE nodes via SHOW BACKENDS
logger.info("No BE hosts configured, getting BE node information via SHOW BACKENDS")
connection = await self.connection_manager.get_connection("query")
result = await connection.execute("SHOW BACKENDS")
be_hosts = []
for row in result.data:
host = row.get("Host")
alive = row.get("Alive", "").lower()
if host and alive == "true":
be_hosts.append(host)
logger.info(f"Got {len(be_hosts)} active BE nodes from SHOW BACKENDS")
return be_hosts
except Exception as e:
logger.error(f"Failed to get BE hosts: {str(e)}")
return []
async def _import_adbc_modules(self) -> Dict[str, Any]:
"""Import ADBC related modules"""
try:
# Import ADBC Driver Manager
try:
import adbc_driver_manager
self.adbc_manager_module = adbc_driver_manager
except ImportError:
return {
"success": False,
"error": "Missing adbc_driver_manager module, please install: pip install adbc_driver_manager",
"error_type": "missing_adbc_manager"
}
# Import ADBC Flight SQL Driver
try:
import adbc_driver_flightsql.dbapi as flight_sql
self.flight_sql_module = flight_sql
except ImportError:
return {
"success": False,
"error": "Missing adbc_driver_flightsql module, please install: pip install adbc_driver_flightsql",
"error_type": "missing_flight_sql_driver"
}
return {
"success": True,
"adbc_manager_version": getattr(adbc_driver_manager, '__version__', 'unknown'),
"flight_sql_version": getattr(flight_sql, '__version__', 'unknown')
}
except Exception as e:
logger.error(f"ADBC module import failed: {str(e)}")
return {
"success": False,
"error": f"ADBC module import failed: {str(e)}",
"error_type": "import_error"
}
async def _create_adbc_connection(self) -> Dict[str, Any]:
"""Create ADBC connection"""
try:
db_config = self.connection_manager.config.database
fe_port = int(os.getenv("FE_ARROW_FLIGHT_SQL_PORT"))
# Build connection URI
uri = f"grpc://{db_config.host}:{fe_port}"
# Create database connection parameters
db_kwargs = {
self.adbc_manager_module.DatabaseOptions.USERNAME.value: db_config.user,
self.adbc_manager_module.DatabaseOptions.PASSWORD.value: db_config.password,
}
# Create connection
self.adbc_client = self.flight_sql_module.connect(
uri=uri,
db_kwargs=db_kwargs
)
return {
"success": True,
"uri": uri,
"connection_established": True
}
except Exception as e:
logger.error(f"Failed to create ADBC connection: {str(e)}")
return {
"success": False,
"error": f"Failed to create ADBC connection: {str(e)}",
"error_type": "connection_error"
}
async def _execute_query_with_adbc(
self,
sql: str,
max_rows: int,
timeout: int,
return_format: str
) -> Dict[str, Any]:
"""Execute query using ADBC"""
try:
if not self.adbc_client:
return {
"success": False,
"error": "ADBC connection not established",
"error_type": "no_connection"
}
cursor = self.adbc_client.cursor()
start_time = time.time()
# Execute query
cursor.execute(sql)
# Get results based on return format
if return_format == "arrow":
# Return Arrow format
arrow_data = cursor.fetchallarrow()
# Limit rows
if len(arrow_data) > max_rows:
arrow_data = arrow_data.slice(0, max_rows)
# Convert Arrow data to serializable format
preview_df = arrow_data.to_pandas().head(10) if len(arrow_data) > 0 else None
result_data = {
"format": "arrow",
"num_rows": len(arrow_data),
"num_columns": len(arrow_data.schema),
"column_names": arrow_data.schema.names,
"column_types": [str(field.type) for field in arrow_data.schema],
"data_preview": _convert_dataframe_to_json_serializable(preview_df) if preview_df is not None else [],
"total_bytes": arrow_data.nbytes if hasattr(arrow_data, 'nbytes') else 0
}
elif return_format == "pandas":
# Return Pandas DataFrame
df = cursor.fetch_df()
# Limit rows
if len(df) > max_rows:
df = df.head(max_rows)
result_data = {
"format": "pandas",
"num_rows": len(df),
"num_columns": len(df.columns),
"column_names": df.columns.tolist(),
"column_types": df.dtypes.astype(str).tolist(),
"data": _convert_dataframe_to_json_serializable(df),
"memory_usage": int(df.memory_usage(deep=True).sum())
}
else: # return_format == "dict"
# Return dictionary format
arrow_data = cursor.fetchallarrow()
df = arrow_data.to_pandas()
# Limit rows
if len(df) > max_rows:
df = df.head(max_rows)
result_data = {
"format": "dict",
"num_rows": len(df),
"num_columns": len(df.columns),
"column_names": df.columns.tolist(),
"column_types": df.dtypes.astype(str).tolist(),
"data": _convert_dataframe_to_json_serializable(df)
}
execution_time = time.time() - start_time
cursor.close()
return {
"success": True,
"result": result_data,
"execution_time": round(execution_time, 3),
"sql": sql,
"max_rows_applied": len(result_data.get("data", [])) >= max_rows
}
except Exception as e:
logger.error(f"ADBC query execution failed: {str(e)}")
return {
"success": False,
"error": f"ADBC query execution failed: {str(e)}",
"error_type": "query_execution_error",
"sql": sql
}
async def get_adbc_connection_info(self) -> Dict[str, Any]:
"""Get ADBC connection information and status"""
try:
# Check port status
port_status = await self._check_arrow_flight_ports()
# Check module status
module_status = await self._import_adbc_modules()
# Get configuration information
db_config = self.connection_manager.config.database
fe_port = os.getenv("FE_ARROW_FLIGHT_SQL_PORT")
be_port = os.getenv("BE_ARROW_FLIGHT_SQL_PORT")
connection_info = {
"adbc_available": module_status["success"],
"ports_available": port_status["success"],
"configuration": {
"fe_host": db_config.host,
"fe_arrow_flight_port": fe_port,
"be_arrow_flight_port": be_port,
"user": db_config.user
},
"port_status": port_status,
"module_status": module_status,
"timestamp": datetime.now().isoformat()
}
if port_status["success"] and module_status["success"]:
connection_info["status"] = "ready"
connection_info["message"] = "ADBC Arrow Flight SQL connection ready"
else:
connection_info["status"] = "not_ready"
errors = []
if not port_status["success"]:
errors.append(port_status["error"])
if not module_status["success"]:
errors.append(module_status["error"])
connection_info["message"] = "; ".join(errors)
return connection_info
except Exception as e:
logger.error(f"Failed to get ADBC connection information: {str(e)}")
return {
"status": "error",
"error": f"Failed to get ADBC connection information: {str(e)}",
"timestamp": datetime.now().isoformat()
}
def __del__(self):
"""Cleanup resources"""
try:
if self.adbc_client:
self.adbc_client.close()
except:
pass

View File

@@ -32,6 +32,8 @@ try:
except ImportError:
load_dotenv = None
from .logger import get_logger
@dataclass
class DatabaseConfig:
@@ -52,13 +54,24 @@ class DatabaseConfig:
be_hosts: list[str] = field(default_factory=list)
be_webserver_port: int = 8040
# Arrow Flight SQL Configuration (Required for ADBC tools)
fe_arrow_flight_sql_port: int | None = None
be_arrow_flight_sql_port: int | None = None
# Connection pool configuration
min_connections: int = 5
# Note: min_connections is fixed at 0 to avoid at_eof connection issues
# This prevents pre-creation of connections which can cause state problems
_min_connections: int = field(default=0, init=False) # Internal use only, always 0
max_connections: int = 20
connection_timeout: int = 30
health_check_interval: int = 60
max_connection_age: int = 3600
@property
def min_connections(self) -> int:
"""Minimum connections is always 0 to prevent at_eof issues"""
return self._min_connections
@dataclass
class SecurityConfig:
@@ -124,6 +137,49 @@ class PerformanceConfig:
max_response_content_size: int = 4096
@dataclass
class DataQualityConfig:
"""Data quality analysis configuration"""
# Column analysis configuration
max_columns_per_batch: int = 20 # Maximum columns to analyze in a single batch
default_sample_size: int = 100000 # Default sample size for analysis
# Sampling strategy configuration
small_table_threshold: int = 100000 # Tables smaller than this use full table analysis
medium_table_threshold: int = 1000000 # Tables smaller than this use simple LIMIT sampling
# Tables larger than medium_table_threshold use systematic sampling
# Performance optimization
enable_batch_analysis: bool = True # Enable batch analysis for multiple columns
batch_timeout: int = 300 # Timeout for batch analysis in seconds
# Accuracy vs Performance trade-off
enable_fast_mode: bool = False # Use approximate algorithms for faster results
fast_mode_sample_size: int = 10000 # Sample size for fast mode
# Statistical analysis configuration
enable_distribution_analysis: bool = True # Enable distribution analysis
histogram_bins: int = 20 # Number of bins for histogram analysis
percentile_levels: list[float] = field(default_factory=lambda: [0.25, 0.5, 0.75, 0.95, 0.99]) # Percentile levels to calculate
@dataclass
class ADBCConfig:
"""ADBC (Arrow Flight SQL) configuration"""
# Default query parameters
default_max_rows: int = 100000
default_timeout: int = 60
default_return_format: str = "arrow" # "arrow", "pandas", "dict"
# Connection timeout for ADBC
connection_timeout: int = 30
# Whether to enable ADBC tools
enabled: bool = True
@dataclass
class LoggingConfig:
"""Logging configuration"""
@@ -138,6 +194,11 @@ class LoggingConfig:
enable_audit: bool = True
audit_file_path: str | None = None
# Log cleanup configuration
enable_cleanup: bool = True
max_age_days: int = 30
cleanup_interval_hours: int = 24
@dataclass
class MonitoringConfig:
@@ -174,8 +235,10 @@ class DorisConfig:
database: DatabaseConfig = field(default_factory=DatabaseConfig)
security: SecurityConfig = field(default_factory=SecurityConfig)
performance: PerformanceConfig = field(default_factory=PerformanceConfig)
data_quality: DataQualityConfig = field(default_factory=DataQualityConfig)
logging: LoggingConfig = field(default_factory=LoggingConfig)
monitoring: MonitoringConfig = field(default_factory=MonitoringConfig)
adbc: ADBCConfig = field(default_factory=ADBCConfig)
# Custom configuration
custom_config: dict[str, Any] = field(default_factory=dict)
@@ -247,10 +310,16 @@ class DorisConfig:
config.database.be_hosts = [host.strip() for host in be_hosts_env.split(",") if host.strip()]
config.database.be_webserver_port = int(os.getenv("DORIS_BE_WEBSERVER_PORT", str(config.database.be_webserver_port)))
# Arrow Flight SQL Configuration
fe_arrow_port_env = os.getenv("FE_ARROW_FLIGHT_SQL_PORT")
if fe_arrow_port_env:
config.database.fe_arrow_flight_sql_port = int(fe_arrow_port_env)
be_arrow_port_env = os.getenv("BE_ARROW_FLIGHT_SQL_PORT")
if be_arrow_port_env:
config.database.be_arrow_flight_sql_port = int(be_arrow_port_env)
# Connection pool configuration
config.database.min_connections = int(
os.getenv("DORIS_MIN_CONNECTIONS", str(config.database.min_connections))
)
config.database.max_connections = int(
os.getenv("DORIS_MAX_CONNECTIONS", str(config.database.max_connections))
)
@@ -323,6 +392,15 @@ class DorisConfig:
os.getenv("ENABLE_AUDIT", str(config.logging.enable_audit).lower()).lower() == "true"
)
config.logging.audit_file_path = os.getenv("AUDIT_FILE_PATH", config.logging.audit_file_path)
config.logging.enable_cleanup = (
os.getenv("ENABLE_LOG_CLEANUP", str(config.logging.enable_cleanup).lower()).lower() == "true"
)
config.logging.max_age_days = int(
os.getenv("LOG_MAX_AGE_DAYS", str(config.logging.max_age_days))
)
config.logging.cleanup_interval_hours = int(
os.getenv("LOG_CLEANUP_INTERVAL_HOURS", str(config.logging.cleanup_interval_hours))
)
# Monitoring configuration
config.monitoring.enable_metrics = (
@@ -339,6 +417,53 @@ class DorisConfig:
)
config.monitoring.alert_webhook_url = os.getenv("ALERT_WEBHOOK_URL", config.monitoring.alert_webhook_url)
# ADBC configuration
config.adbc.default_max_rows = int(
os.getenv("ADBC_DEFAULT_MAX_ROWS", str(config.adbc.default_max_rows))
)
config.adbc.default_timeout = int(
os.getenv("ADBC_DEFAULT_TIMEOUT", str(config.adbc.default_timeout))
)
config.adbc.default_return_format = os.getenv("ADBC_DEFAULT_RETURN_FORMAT", config.adbc.default_return_format)
config.adbc.connection_timeout = int(
os.getenv("ADBC_CONNECTION_TIMEOUT", str(config.adbc.connection_timeout))
)
config.adbc.enabled = (
os.getenv("ADBC_ENABLED", str(config.adbc.enabled).lower()).lower() == "true"
)
# Data quality configuration
config.data_quality.max_columns_per_batch = int(
os.getenv("DATA_QUALITY_MAX_COLUMNS_PER_BATCH", str(config.data_quality.max_columns_per_batch))
)
config.data_quality.default_sample_size = int(
os.getenv("DATA_QUALITY_DEFAULT_SAMPLE_SIZE", str(config.data_quality.default_sample_size))
)
config.data_quality.small_table_threshold = int(
os.getenv("DATA_QUALITY_SMALL_TABLE_THRESHOLD", str(config.data_quality.small_table_threshold))
)
config.data_quality.medium_table_threshold = int(
os.getenv("DATA_QUALITY_MEDIUM_TABLE_THRESHOLD", str(config.data_quality.medium_table_threshold))
)
config.data_quality.enable_batch_analysis = (
os.getenv("DATA_QUALITY_ENABLE_BATCH_ANALYSIS", str(config.data_quality.enable_batch_analysis).lower()).lower() == "true"
)
config.data_quality.batch_timeout = int(
os.getenv("DATA_QUALITY_BATCH_TIMEOUT", str(config.data_quality.batch_timeout))
)
config.data_quality.enable_fast_mode = (
os.getenv("DATA_QUALITY_ENABLE_FAST_MODE", str(config.data_quality.enable_fast_mode).lower()).lower() == "true"
)
config.data_quality.fast_mode_sample_size = int(
os.getenv("DATA_QUALITY_FAST_MODE_SAMPLE_SIZE", str(config.data_quality.fast_mode_sample_size))
)
config.data_quality.enable_distribution_analysis = (
os.getenv("DATA_QUALITY_ENABLE_DISTRIBUTION_ANALYSIS", str(config.data_quality.enable_distribution_analysis).lower()).lower() == "true"
)
config.data_quality.histogram_bins = int(
os.getenv("DATA_QUALITY_HISTOGRAM_BINS", str(config.data_quality.histogram_bins))
)
# Server configuration
config.server_name = os.getenv("SERVER_NAME", config.server_name)
config.server_version = os.getenv("SERVER_VERSION", config.server_version)
@@ -378,6 +503,13 @@ class DorisConfig:
if hasattr(config.performance, key):
setattr(config.performance, key, value)
# Update data quality configuration
if "data_quality" in config_data:
dq_config = config_data["data_quality"]
for key, value in dq_config.items():
if hasattr(config.data_quality, key):
setattr(config.data_quality, key, value)
# Update logging configuration
if "logging" in config_data:
log_config = config_data["logging"]
@@ -392,6 +524,13 @@ class DorisConfig:
if hasattr(config.monitoring, key):
setattr(config.monitoring, key, value)
# Update ADBC configuration
if "adbc" in config_data:
adbc_config = config_data["adbc"]
for key, value in adbc_config.items():
if hasattr(config.adbc, key):
setattr(config.adbc, key, value)
# Custom configuration
config.custom_config = config_data.get("custom", {})
@@ -414,7 +553,9 @@ class DorisConfig:
"fe_http_port": self.database.fe_http_port,
"be_hosts": self.database.be_hosts,
"be_webserver_port": self.database.be_webserver_port,
"min_connections": self.database.min_connections,
"fe_arrow_flight_sql_port": self.database.fe_arrow_flight_sql_port,
"be_arrow_flight_sql_port": self.database.be_arrow_flight_sql_port,
"min_connections": self.database.min_connections, # Always 0, shown for reference
"max_connections": self.database.max_connections,
"connection_timeout": self.database.connection_timeout,
"health_check_interval": self.database.health_check_interval,
@@ -442,6 +583,19 @@ class DorisConfig:
"idle_timeout": self.performance.idle_timeout,
"max_response_content_size": self.performance.max_response_content_size,
},
"data_quality": {
"max_columns_per_batch": self.data_quality.max_columns_per_batch,
"default_sample_size": self.data_quality.default_sample_size,
"small_table_threshold": self.data_quality.small_table_threshold,
"medium_table_threshold": self.data_quality.medium_table_threshold,
"enable_batch_analysis": self.data_quality.enable_batch_analysis,
"batch_timeout": self.data_quality.batch_timeout,
"enable_fast_mode": self.data_quality.enable_fast_mode,
"fast_mode_sample_size": self.data_quality.fast_mode_sample_size,
"enable_distribution_analysis": self.data_quality.enable_distribution_analysis,
"histogram_bins": self.data_quality.histogram_bins,
"percentile_levels": self.data_quality.percentile_levels,
},
"logging": {
"level": self.logging.level,
"format": self.logging.format,
@@ -450,6 +604,9 @@ class DorisConfig:
"backup_count": self.logging.backup_count,
"enable_audit": self.logging.enable_audit,
"audit_file_path": self.logging.audit_file_path,
"enable_cleanup": self.logging.enable_cleanup,
"max_age_days": self.logging.max_age_days,
"cleanup_interval_hours": self.logging.cleanup_interval_hours,
},
"monitoring": {
"enable_metrics": self.monitoring.enable_metrics,
@@ -460,6 +617,13 @@ class DorisConfig:
"enable_alerts": self.monitoring.enable_alerts,
"alert_webhook_url": self.monitoring.alert_webhook_url,
},
"adbc": {
"default_max_rows": self.adbc.default_max_rows,
"default_timeout": self.adbc.default_timeout,
"default_return_format": self.adbc.default_return_format,
"connection_timeout": self.adbc.connection_timeout,
"enabled": self.adbc.enabled,
},
"custom": self.custom_config,
}
@@ -492,11 +656,8 @@ class DorisConfig:
if not self.database.user:
errors.append("Database username cannot be empty")
if self.database.min_connections <= 0:
errors.append("Minimum connections must be greater than 0")
if self.database.max_connections <= self.database.min_connections:
errors.append("Maximum connections must be greater than minimum connections")
if self.database.max_connections <= 0:
errors.append("Maximum connections must be greater than 0")
# Validate security configuration
if self.security.auth_type not in ["token", "basic", "oauth"]:
@@ -521,6 +682,31 @@ class DorisConfig:
if self.performance.query_timeout <= 0:
errors.append("Query timeout must be greater than 0")
# Validate data quality configuration
if self.data_quality.max_columns_per_batch <= 0:
errors.append("Max columns per batch must be greater than 0")
if self.data_quality.default_sample_size <= 0:
errors.append("Default sample size must be greater than 0")
if self.data_quality.small_table_threshold <= 0:
errors.append("Small table threshold must be greater than 0")
if self.data_quality.medium_table_threshold <= 0:
errors.append("Medium table threshold must be greater than 0")
if self.data_quality.small_table_threshold >= self.data_quality.medium_table_threshold:
errors.append("Small table threshold must be less than medium table threshold")
if self.data_quality.batch_timeout <= 0:
errors.append("Batch timeout must be greater than 0")
if self.data_quality.fast_mode_sample_size <= 0:
errors.append("Fast mode sample size must be greater than 0")
if self.data_quality.histogram_bins <= 0:
errors.append("Histogram bins must be greater than 0")
# Validate logging configuration
if self.logging.level not in ["DEBUG", "INFO", "WARNING", "ERROR", "CRITICAL"]:
errors.append("Log level must be one of DEBUG, INFO, WARNING, ERROR, or CRITICAL")
@@ -531,6 +717,12 @@ class DorisConfig:
if self.logging.backup_count < 0:
errors.append("Log backup count cannot be negative")
if self.logging.max_age_days <= 0:
errors.append("Log max age days must be greater than 0")
if self.logging.cleanup_interval_hours <= 0:
errors.append("Log cleanup interval hours must be greater than 0")
# Validate monitoring configuration
if not (1 <= self.monitoring.metrics_port <= 65535):
errors.append("Monitoring port must be in the range 1-65535")
@@ -538,6 +730,19 @@ class DorisConfig:
if not (1 <= self.monitoring.health_check_port <= 65535):
errors.append("Health check port must be in the range 1-65535")
# Validate ADBC configuration
if self.adbc.default_max_rows <= 0:
errors.append("ADBC default max rows must be greater than 0")
if self.adbc.default_timeout <= 0:
errors.append("ADBC default timeout must be greater than 0")
if self.adbc.default_return_format not in ["arrow", "pandas", "dict"]:
errors.append("ADBC default return format must be one of arrow, pandas, or dict")
if self.adbc.connection_timeout <= 0:
errors.append("ADBC connection timeout must be greater than 0")
return errors
def get_connection_string(self) -> str:
@@ -549,7 +754,7 @@ class DorisConfig:
return {
"server": f"{self.server_name} v{self.server_version}",
"database": f"{self.database.host}:{self.database.port}/{self.database.database}",
"connection_pool": f"{self.database.min_connections}-{self.database.max_connections}",
"connection_pool": f"0-{self.database.max_connections} (min fixed at 0 for stability)",
"security": {
"auth_type": self.security.auth_type,
"masking_enabled": self.security.enable_masking,
@@ -575,56 +780,50 @@ class ConfigManager:
self.logger = logging.getLogger(__name__)
def setup_logging(self):
"""Setup logging configuration"""
# Configure root logger
root_logger = logging.getLogger()
root_logger.setLevel(getattr(logging, self.config.logging.level.upper()))
"""Setup logging configuration using enhanced logger"""
from .logger import setup_logging, get_logger
import sys
# Clear existing handlers
for handler in root_logger.handlers[:]:
root_logger.removeHandler(handler)
# Create formatter
formatter = logging.Formatter(self.config.logging.format)
# Console handler
console_handler = logging.StreamHandler()
console_handler.setFormatter(formatter)
root_logger.addHandler(console_handler)
# File handler (if configured)
# Determine log directory
log_dir = "logs"
if self.config.logging.file_path:
try:
from logging.handlers import RotatingFileHandler
# Extract directory from file path if provided
from pathlib import Path
log_dir = str(Path(self.config.logging.file_path).parent)
file_handler = RotatingFileHandler(
self.config.logging.file_path,
maxBytes=self.config.logging.max_file_size,
backupCount=self.config.logging.backup_count,
encoding="utf-8",
# Detect if we're in stdio mode by checking if this is likely MCP stdio communication
# In stdio mode, we shouldn't output to console as it interferes with JSON protocol
is_stdio_mode = (
self.config.transport == "stdio" or
"--transport" in sys.argv and "stdio" in sys.argv or
not sys.stdout.isatty() # Not a terminal (likely piped/redirected)
)
file_handler.setFormatter(formatter)
root_logger.addHandler(file_handler)
except Exception as e:
self.logger.warning(f"Failed to setup file logging: {e}")
# Audit log handler (if configured)
if self.config.logging.enable_audit and self.config.logging.audit_file_path:
try:
from logging.handlers import RotatingFileHandler
audit_logger = logging.getLogger("audit")
audit_handler = RotatingFileHandler(
self.config.logging.audit_file_path,
maxBytes=self.config.logging.max_file_size,
backupCount=self.config.logging.backup_count,
encoding="utf-8",
# Setup enhanced logging with cleanup functionality
setup_logging(
level=self.config.logging.level,
log_dir=log_dir,
enable_console=not is_stdio_mode, # Disable console logging in stdio mode
enable_file=True,
enable_audit=self.config.logging.enable_audit,
audit_file=self.config.logging.audit_file_path,
max_file_size=self.config.logging.max_file_size,
backup_count=self.config.logging.backup_count,
enable_cleanup=self.config.logging.enable_cleanup,
max_age_days=self.config.logging.max_age_days,
cleanup_interval_hours=self.config.logging.cleanup_interval_hours
)
audit_handler.setFormatter(formatter)
audit_logger.addHandler(audit_handler)
audit_logger.setLevel(logging.INFO)
except Exception as e:
self.logger.warning(f"Failed to setup audit logging: {e}")
# Update logger to use new system
self.logger = get_logger(__name__)
self.logger.info("Enhanced logging system with cleanup initialized successfully")
self.logger.info(f"Log directory: {log_dir}")
self.logger.info(f"Log level: {self.config.logging.level}")
self.logger.info(f"Audit logging: {'Enabled' if self.config.logging.enable_audit else 'Disabled'}")
self.logger.info(f"Log cleanup: {'Enabled' if self.config.logging.enable_cleanup else 'Disabled'}")
if self.config.logging.enable_cleanup:
self.logger.info(f"Cleanup config: Max age {self.config.logging.max_age_days} days, interval {self.config.logging.cleanup_interval_hours}h")
def validate_config(self) -> bool:
"""Validate configuration"""

View File

@@ -0,0 +1,733 @@
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.
"""
Data Exploration Tools Module
Provides table data distribution analysis and exploration capabilities
"""
import time
import math
from datetime import datetime
from typing import Any, Dict, List, Optional, Union
from .db import DorisConnectionManager
from .logger import get_logger
logger = get_logger(__name__)
class DataExplorationTools:
"""Data exploration tools for table distribution analysis"""
def __init__(self, connection_manager: DorisConnectionManager):
self.connection_manager = connection_manager
logger.info("DataExplorationTools initialized")
# ==================== Private Helper Methods ====================
def _build_full_table_name(self, table_name: str, catalog_name: Optional[str], db_name: Optional[str]) -> str:
"""Build full table name with catalog and database using three-part naming convention"""
# Default catalog for internal tables
effective_catalog = catalog_name if catalog_name else "internal"
if db_name:
return f"{effective_catalog}.{db_name}.{table_name}"
else:
# If no db_name provided, need to determine the current database
return f"{effective_catalog}.{table_name}"
async def _get_table_basic_info(self, connection, table_name: str) -> Optional[Dict]:
"""Get basic table information including row count"""
try:
count_sql = f"SELECT COUNT(*) as row_count FROM {table_name}"
result = await connection.execute(count_sql)
if result.data:
return {"row_count": result.data[0]["row_count"]}
return None
except Exception as e:
logger.warning(f"Failed to get basic info for table {table_name}: {str(e)}")
return {"row_count": 0}
async def _get_table_columns_info(self, connection, table_name: str, catalog_name: Optional[str], db_name: Optional[str]) -> List[Dict]:
"""Get detailed column information"""
try:
where_conditions = [f"table_name = '{table_name}'"]
if db_name:
where_conditions.append(f"table_schema = '{db_name}'")
else:
where_conditions.append("table_schema = DATABASE()")
columns_sql = f"""
SELECT
column_name,
data_type,
is_nullable,
column_comment,
ordinal_position
FROM information_schema.columns
WHERE {' AND '.join(where_conditions)}
ORDER BY ordinal_position
"""
result = await connection.execute(columns_sql)
return result.data if result.data else []
except Exception as e:
logger.warning(f"Failed to get columns info for table {table_name}: {str(e)}")
return []
async def _determine_sampling_strategy(self, connection, table_name: str, total_rows: int, sample_size: int) -> Dict[str, Any]:
"""Determine optimal sampling strategy based on table size"""
if total_rows <= sample_size:
# Use all data if table is small enough
return {
"total_rows": total_rows,
"sample_size": total_rows,
"sampling_method": "full_scan",
"sampling_ratio": 1.0,
"use_sampling": False,
"sample_table_expression": table_name
}
else:
# Use random sampling for large tables
sampling_ratio = sample_size / total_rows
return {
"total_rows": total_rows,
"sample_size": sample_size,
"sampling_method": "random_sample",
"sampling_ratio": round(sampling_ratio, 4),
"use_sampling": True,
"sample_table_expression": f"(SELECT * FROM {table_name} ORDER BY RAND() LIMIT {sample_size}) as sample_table"
}
def _select_analysis_columns(self, columns_info: List[Dict], include_all: bool) -> List[Dict]:
"""Select columns for analysis based on strategy"""
if include_all:
return columns_info
# If not analyzing all columns, prioritize key columns
priority_keywords = ['id', 'key', 'code', 'status', 'type', 'amount', 'count', 'date', 'time']
priority_columns = []
other_columns = []
for col in columns_info:
col_name_lower = col["column_name"].lower()
if any(keyword in col_name_lower for keyword in priority_keywords):
priority_columns.append(col)
else:
other_columns.append(col)
# Return priority columns plus first 10 other columns
return priority_columns + other_columns[:10]
def _is_numeric_type(self, data_type: str) -> bool:
"""Check if column type is numeric"""
numeric_types = [
'tinyint', 'smallint', 'int', 'bigint', 'largeint',
'float', 'double', 'decimal', 'numeric'
]
return any(num_type in data_type.lower() for num_type in numeric_types)
def _is_categorical_type(self, data_type: str) -> bool:
"""Check if column type is categorical"""
categorical_types = ['varchar', 'char', 'string', 'text', 'enum']
return any(cat_type in data_type.lower() for cat_type in categorical_types)
def _is_temporal_type(self, data_type: str) -> bool:
"""Check if column type is temporal"""
temporal_types = ['date', 'datetime', 'timestamp', 'time']
return any(temp_type in data_type.lower() for temp_type in temporal_types)
async def _analyze_numeric_distributions(self, connection, table_name: str, numeric_columns: List[Dict], sampling_info: Dict) -> Dict[str, Any]:
"""Analyze distribution patterns for numeric columns"""
numeric_analysis = {}
for column in numeric_columns:
col_name = column["column_name"]
try:
# Basic statistics
table_expr = sampling_info.get("sample_table_expression", table_name)
stats_sql = f"""
SELECT
COUNT({col_name}) as count,
MIN({col_name}) as min_value,
MAX({col_name}) as max_value,
AVG({col_name}) as mean_value,
STDDEV({col_name}) as std_dev
FROM {table_expr}
WHERE {col_name} IS NOT NULL
"""
stats_result = await connection.execute(stats_sql)
if stats_result.data and stats_result.data[0]["count"] > 0:
stats = stats_result.data[0]
# Percentiles calculation
percentiles = await self._calculate_percentiles(connection, table_name, col_name, sampling_info)
# Outlier detection
outliers = await self._detect_numeric_outliers(connection, table_name, col_name, percentiles, sampling_info)
# Distribution shape analysis
distribution_shape = await self._analyze_distribution_shape(
connection, table_name, col_name, stats, percentiles, sampling_info
)
numeric_analysis[col_name] = {
"data_type": column["data_type"],
"statistics": {
"count": stats["count"],
"mean": round(float(stats["mean_value"]), 4) if stats["mean_value"] else None,
"std": round(float(stats["std_dev"]), 4) if stats["std_dev"] else None,
"min": float(stats["min_value"]) if stats["min_value"] else None,
"max": float(stats["max_value"]) if stats["max_value"] else None,
**percentiles
},
"distribution_shape": distribution_shape,
"outliers": outliers
}
except Exception as e:
logger.warning(f"Failed to analyze numeric column {col_name}: {str(e)}")
numeric_analysis[col_name] = {"error": str(e)}
return numeric_analysis
async def _calculate_percentiles(self, connection, table_name: str, col_name: str, sampling_info: Dict) -> Dict[str, float]:
"""Calculate percentiles for numeric column"""
try:
table_expr = sampling_info.get("sample_table_expression", table_name)
percentile_sql = f"""
SELECT
PERCENTILE({col_name}, 0.25) as p25,
PERCENTILE({col_name}, 0.50) as p50,
PERCENTILE({col_name}, 0.75) as p75,
PERCENTILE({col_name}, 0.90) as p90,
PERCENTILE({col_name}, 0.95) as p95,
PERCENTILE({col_name}, 0.99) as p99
FROM {table_expr}
WHERE {col_name} IS NOT NULL
"""
result = await connection.execute(percentile_sql)
if result.data:
data = result.data[0]
return {
"25%": round(float(data["p25"]), 4) if data["p25"] else None,
"50%": round(float(data["p50"]), 4) if data["p50"] else None,
"75%": round(float(data["p75"]), 4) if data["p75"] else None,
"90%": round(float(data["p90"]), 4) if data["p90"] else None,
"95%": round(float(data["p95"]), 4) if data["p95"] else None,
"99%": round(float(data["p99"]), 4) if data["p99"] else None
}
except Exception as e:
logger.warning(f"Failed to calculate percentiles for {col_name}: {str(e)}")
return {}
async def _detect_numeric_outliers(self, connection, table_name: str, col_name: str, percentiles: Dict, sampling_info: Dict) -> Dict[str, Any]:
"""Detect outliers using IQR method"""
try:
if "25%" not in percentiles or "75%" not in percentiles:
return {"outlier_count": 0, "outlier_rate": 0.0}
q1 = percentiles["25%"]
q3 = percentiles["75%"]
iqr = q3 - q1
lower_bound = q1 - 1.5 * iqr
upper_bound = q3 + 1.5 * iqr
table_expr = sampling_info.get("sample_table_expression", table_name)
outlier_sql = f"""
SELECT
COUNT(*) as total_count,
SUM(CASE WHEN {col_name} < {lower_bound} OR {col_name} > {upper_bound} THEN 1 ELSE 0 END) as outlier_count
FROM {table_expr}
WHERE {col_name} IS NOT NULL
"""
result = await connection.execute(outlier_sql)
if result.data:
data = result.data[0]
total_count = data["total_count"]
outlier_count = data["outlier_count"]
outlier_rate = outlier_count / total_count if total_count > 0 else 0
return {
"outlier_count": outlier_count,
"outlier_rate": round(outlier_rate, 4),
"outlier_threshold_lower": round(lower_bound, 4),
"outlier_threshold_upper": round(upper_bound, 4),
"iqr": round(iqr, 4)
}
except Exception as e:
logger.warning(f"Failed to detect outliers for {col_name}: {str(e)}")
return {"outlier_count": 0, "outlier_rate": 0.0}
async def _analyze_distribution_shape(self, connection, table_name: str, col_name: str, stats: Dict, percentiles: Dict, sampling_info: Dict) -> Dict[str, Any]:
"""Analyze the shape of data distribution"""
try:
mean = stats.get("mean_value", 0)
median = percentiles.get("50%", 0)
if mean is None or median is None:
return {"distribution_type": "unknown"}
# Calculate skewness indicator
if abs(mean - median) < 0.01:
skew_indicator = "symmetric"
elif mean > median:
skew_indicator = "right_skewed"
else:
skew_indicator = "left_skewed"
# Estimate kurtosis based on percentile spread
if "25%" in percentiles and "75%" in percentiles:
iqr = percentiles["75%"] - percentiles["25%"]
range_90 = percentiles.get("90%", percentiles["75%"]) - percentiles.get("10%", percentiles["25%"])
if iqr > 0:
kurtosis_indicator = "normal" if 2.5 <= range_90/iqr <= 3.5 else ("heavy_tailed" if range_90/iqr > 3.5 else "light_tailed")
else:
kurtosis_indicator = "unknown"
else:
kurtosis_indicator = "unknown"
return {
"skewness_indicator": skew_indicator,
"kurtosis_indicator": kurtosis_indicator,
"distribution_type": self._classify_distribution_type(skew_indicator, kurtosis_indicator),
"mean_median_ratio": round(mean / median, 4) if median != 0 else None
}
except Exception as e:
logger.warning(f"Failed to analyze distribution shape for {col_name}: {str(e)}")
return {"distribution_type": "unknown"}
def _classify_distribution_type(self, skew: str, kurtosis: str) -> str:
"""Classify distribution type based on skewness and kurtosis"""
if skew == "symmetric" and kurtosis == "normal":
return "approximately_normal"
elif skew == "right_skewed":
return "right_skewed"
elif skew == "left_skewed":
return "left_skewed"
elif kurtosis == "heavy_tailed":
return "heavy_tailed"
else:
return "non_normal"
async def _analyze_categorical_distributions(self, connection, table_name: str, categorical_columns: List[Dict], sampling_info: Dict) -> Dict[str, Any]:
"""Analyze distribution patterns for categorical columns"""
categorical_analysis = {}
for column in categorical_columns:
col_name = column["column_name"]
try:
# Basic cardinality and distribution
cardinality_sql = f"""
SELECT
COUNT(DISTINCT {col_name}) as cardinality,
COUNT({col_name}) as non_null_count
FROM {table_name}
WHERE {col_name} IS NOT NULL
{sampling_info.get('sample_query_suffix', '')}
"""
cardinality_result = await connection.execute(cardinality_sql)
if cardinality_result.data:
cardinality_data = cardinality_result.data[0]
cardinality = cardinality_data["cardinality"]
non_null_count = cardinality_data["non_null_count"]
# Value distribution (top values)
value_distribution = await self._get_categorical_value_distribution(
connection, table_name, col_name, sampling_info, non_null_count
)
# Calculate entropy and concentration
entropy = self._calculate_entropy(value_distribution)
concentration_ratio = value_distribution[0]["percentage"] if value_distribution else 0
categorical_analysis[col_name] = {
"data_type": column["data_type"],
"cardinality": cardinality,
"non_null_count": non_null_count,
"value_distribution": value_distribution,
"entropy": round(entropy, 3),
"concentration_ratio": round(concentration_ratio, 4),
"diversity_score": round(cardinality / non_null_count, 4) if non_null_count > 0 else 0
}
except Exception as e:
logger.warning(f"Failed to analyze categorical column {col_name}: {str(e)}")
categorical_analysis[col_name] = {"error": str(e)}
return categorical_analysis
async def _get_categorical_value_distribution(self, connection, table_name: str, col_name: str, sampling_info: Dict, total_count: int) -> List[Dict]:
"""Get value distribution for categorical column"""
try:
# Use sample table expression if sampling is enabled
table_expr = sampling_info.get("sample_table_expression", table_name)
distribution_sql = f"""
SELECT
{col_name} as value,
COUNT(*) as count
FROM {table_expr}
WHERE {col_name} IS NOT NULL
GROUP BY {col_name}
ORDER BY COUNT(*) DESC
LIMIT 20
"""
result = await connection.execute(distribution_sql)
if result.data:
distribution = []
for row in result.data:
count = row["count"]
percentage = count / total_count if total_count > 0 else 0
distribution.append({
"value": str(row["value"]),
"count": count,
"percentage": round(percentage, 4)
})
return distribution
except Exception as e:
logger.warning(f"Failed to get value distribution for {col_name}: {str(e)}")
return []
def _calculate_entropy(self, value_distribution: List[Dict]) -> float:
"""Calculate Shannon entropy for categorical distribution"""
if not value_distribution:
return 0.0
entropy = 0.0
for item in value_distribution:
p = item["percentage"]
if p > 0:
entropy -= p * math.log2(p)
return entropy
async def _analyze_temporal_distributions(self, connection, table_name: str, temporal_columns: List[Dict], sampling_info: Dict) -> Dict[str, Any]:
"""Analyze distribution patterns for temporal columns"""
temporal_analysis = {}
for column in temporal_columns:
col_name = column["column_name"]
try:
# Date range analysis
table_expr = sampling_info.get("sample_table_expression", table_name)
range_sql = f"""
SELECT
MIN({col_name}) as earliest,
MAX({col_name}) as latest,
COUNT({col_name}) as non_null_count
FROM {table_expr}
WHERE {col_name} IS NOT NULL
"""
range_result = await connection.execute(range_sql)
if range_result.data and range_result.data[0]["non_null_count"] > 0:
range_data = range_result.data[0]
earliest = range_data["earliest"]
latest = range_data["latest"]
# Calculate span
date_span_info = self._calculate_date_span(earliest, latest)
# Temporal patterns analysis
temporal_patterns = await self._analyze_temporal_patterns(
connection, table_name, col_name, sampling_info
)
temporal_analysis[col_name] = {
"data_type": column["data_type"],
"non_null_count": range_data["non_null_count"],
"date_range": {
"earliest": str(earliest),
"latest": str(latest),
**date_span_info
},
"temporal_patterns": temporal_patterns
}
except Exception as e:
logger.warning(f"Failed to analyze temporal column {col_name}: {str(e)}")
temporal_analysis[col_name] = {"error": str(e)}
return temporal_analysis
def _calculate_date_span(self, earliest, latest) -> Dict[str, Any]:
"""Calculate date span information"""
try:
if isinstance(earliest, str):
earliest = datetime.fromisoformat(earliest.replace('Z', '+00:00'))
if isinstance(latest, str):
latest = datetime.fromisoformat(latest.replace('Z', '+00:00'))
span = latest - earliest
span_days = span.days
return {
"span_days": span_days,
"span_years": round(span_days / 365.25, 2),
"span_description": self._describe_time_span(span_days)
}
except Exception as e:
logger.warning(f"Failed to calculate date span: {str(e)}")
return {"span_days": 0}
def _describe_time_span(self, days: int) -> str:
"""Describe time span in human readable format"""
if days < 1:
return "less_than_day"
elif days < 7:
return "days"
elif days < 30:
return "weeks"
elif days < 365:
return "months"
else:
return "years"
async def _analyze_temporal_patterns(self, connection, table_name: str, col_name: str, sampling_info: Dict) -> Dict[str, Any]:
"""Analyze temporal patterns like seasonality and trends"""
try:
table_expr = sampling_info.get("sample_table_expression", table_name)
# Weekly pattern analysis
weekly_pattern_sql = f"""
SELECT
DAYOFWEEK({col_name}) as day_of_week,
COUNT(*) as count
FROM {table_expr}
WHERE {col_name} IS NOT NULL
GROUP BY DAYOFWEEK({col_name})
ORDER BY day_of_week
"""
weekly_result = await connection.execute(weekly_pattern_sql)
weekly_pattern = []
if weekly_result.data:
total_records = sum(row["count"] for row in weekly_result.data)
for row in weekly_result.data:
percentage = row["count"] / total_records if total_records > 0 else 0
weekly_pattern.append(round(percentage, 3))
# Monthly trend analysis (simplified)
monthly_trend_sql = f"""
SELECT
YEAR({col_name}) as year,
MONTH({col_name}) as month,
COUNT(*) as count
FROM {table_expr}
WHERE {col_name} IS NOT NULL
GROUP BY YEAR({col_name}), MONTH({col_name})
ORDER BY year, month
LIMIT 12
"""
monthly_result = await connection.execute(monthly_trend_sql)
monthly_trend = "stable" # Simplified trend analysis
if monthly_result.data and len(monthly_result.data) > 3:
counts = [row["count"] for row in monthly_result.data]
if len(counts) > 1:
trend_direction = "increasing" if counts[-1] > counts[0] else "decreasing"
monthly_trend = trend_direction
return {
"weekly_pattern": weekly_pattern,
"monthly_trend": monthly_trend,
"seasonal_component": self._estimate_seasonality(weekly_pattern)
}
except Exception as e:
logger.warning(f"Failed to analyze temporal patterns for {col_name}: {str(e)}")
return {"weekly_pattern": [], "monthly_trend": "unknown"}
def _estimate_seasonality(self, weekly_pattern: List[float]) -> float:
"""Estimate seasonality strength based on weekly pattern variance"""
if len(weekly_pattern) < 7:
return 0.0
mean_percentage = sum(weekly_pattern) / len(weekly_pattern)
variance = sum((x - mean_percentage) ** 2 for x in weekly_pattern) / len(weekly_pattern)
# Normalize variance to 0-1 scale as seasonality indicator
seasonality = min(variance * 10, 1.0) # Scaling factor
return round(seasonality, 3)
async def _generate_data_quality_insights(self, connection, table_name: str, columns: List[Dict], sampling_info: Dict) -> Dict[str, Any]:
"""Generate overall data quality insights"""
try:
total_columns = len(columns)
# Calculate null rates across all columns
null_analysis = await self._analyze_overall_null_rates(connection, table_name, columns, sampling_info)
# Identify potential data quality issues
quality_issues = []
# High null rate columns
high_null_columns = [col for col, rate in null_analysis["column_null_rates"].items() if rate > 0.2]
if high_null_columns:
quality_issues.append({
"issue_type": "high_null_rates",
"severity": "medium",
"affected_columns": high_null_columns,
"description": f"{len(high_null_columns)} columns have null rates > 20%"
})
# Calculate overall data quality score
avg_null_rate = sum(null_analysis["column_null_rates"].values()) / len(null_analysis["column_null_rates"]) if null_analysis["column_null_rates"] else 0
data_quality_score = max(0, 1 - avg_null_rate)
return {
"total_columns_analyzed": total_columns,
"null_analysis": null_analysis,
"data_quality_score": round(data_quality_score, 3),
"quality_issues": quality_issues,
"recommendations": self._generate_quality_recommendations(quality_issues, null_analysis)
}
except Exception as e:
logger.warning(f"Failed to generate data quality insights: {str(e)}")
return {"data_quality_score": 0.0, "error": str(e)}
async def _analyze_overall_null_rates(self, connection, table_name: str, columns: List[Dict], sampling_info: Dict) -> Dict[str, Any]:
"""Analyze null rates across all columns"""
column_null_rates = {}
total_null_count = 0
total_cell_count = 0
for column in columns:
col_name = column["column_name"]
try:
table_expr = sampling_info.get("sample_table_expression", table_name)
null_sql = f"""
SELECT
COUNT(*) as total_count,
COUNT({col_name}) as non_null_count
FROM {table_expr}
"""
result = await connection.execute(null_sql)
if result.data:
data = result.data[0]
total_count = data["total_count"]
non_null_count = data["non_null_count"]
null_count = total_count - non_null_count
null_rate = null_count / total_count if total_count > 0 else 0
column_null_rates[col_name] = round(null_rate, 4)
total_null_count += null_count
total_cell_count += total_count
except Exception as e:
logger.warning(f"Failed to analyze null rate for column {col_name}: {str(e)}")
column_null_rates[col_name] = 0.0
overall_null_rate = total_null_count / total_cell_count if total_cell_count > 0 else 0
return {
"column_null_rates": column_null_rates,
"overall_null_rate": round(overall_null_rate, 4),
"columns_with_nulls": len([rate for rate in column_null_rates.values() if rate > 0])
}
def _generate_quality_recommendations(self, quality_issues: List[Dict], null_analysis: Dict) -> List[Dict]:
"""Generate data quality improvement recommendations"""
recommendations = []
# Recommendations based on null analysis
overall_null_rate = null_analysis.get("overall_null_rate", 0)
if overall_null_rate > 0.1:
recommendations.append({
"type": "data_completeness",
"priority": "high" if overall_null_rate > 0.3 else "medium",
"description": f"Overall null rate is {overall_null_rate:.1%}",
"action": "Review data collection and validation processes"
})
# Recommendations based on quality issues
for issue in quality_issues:
if issue["issue_type"] == "high_null_rates":
recommendations.append({
"type": "column_completeness",
"priority": issue["severity"],
"description": issue["description"],
"action": f"Focus on improving data completeness for: {', '.join(issue['affected_columns'][:3])}"
})
return recommendations
def _generate_analysis_summary(self, distribution_analysis: Dict[str, Any]) -> Dict[str, Any]:
"""Generate high-level summary of distribution analysis"""
summary = {
"numeric_columns_count": len(distribution_analysis.get("numeric_columns", {})),
"categorical_columns_count": len(distribution_analysis.get("categorical_columns", {})),
"temporal_columns_count": len(distribution_analysis.get("temporal_columns", {}))
}
# Identify interesting patterns
patterns = []
# Check for highly skewed numeric columns
numeric_cols = distribution_analysis.get("numeric_columns", {})
skewed_cols = [
col for col, info in numeric_cols.items()
if isinstance(info, dict) and
info.get("distribution_shape", {}).get("skewness_indicator") in ["right_skewed", "left_skewed"]
]
if skewed_cols:
patterns.append(f"Found {len(skewed_cols)} skewed numeric columns")
# Check for high cardinality categorical columns
categorical_cols = distribution_analysis.get("categorical_columns", {})
high_cardinality_cols = [
col for col, info in categorical_cols.items()
if isinstance(info, dict) and info.get("cardinality", 0) > 1000
]
if high_cardinality_cols:
patterns.append(f"Found {len(high_cardinality_cols)} high cardinality categorical columns")
summary["notable_patterns"] = patterns
return summary

View File

@@ -0,0 +1,897 @@
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.
"""
Data Governance Tools Module
Provides data completeness analysis, field lineage tracking, and data freshness monitoring
"""
import re
import time
from datetime import datetime, timedelta
from typing import Any, Dict, List, Optional
from .db import DorisConnectionManager
from .logger import get_logger
logger = get_logger(__name__)
class DataGovernanceTools:
"""Data governance tools suite"""
def __init__(self, connection_manager: DorisConnectionManager):
self.connection_manager = connection_manager
logger.info("DataGovernanceTools initialized")
async def trace_column_lineage(
self,
table_name: str,
column_name: str,
depth: int = 3,
catalog_name: Optional[str] = None,
db_name: Optional[str] = None
) -> Dict[str, Any]:
"""
Column-level lineage tracing
Args:
table_name: Table name
column_name: Column name
depth: Trace depth
catalog_name: Catalog name
db_name: Database name
"""
try:
start_time = time.time()
# 🚀 PROGRESS: Initialize column lineage tracing
logger.info("=" * 60)
logger.info(f"🔍 Starting Column Lineage Tracing")
logger.info(f"📊 Target: {table_name}.{column_name}")
logger.info(f"🎯 Trace depth: {depth}")
logger.info("=" * 60)
connection = await self.connection_manager.get_connection("query")
full_table_name = self._build_full_table_name(table_name, catalog_name, db_name)
target_column = f"{full_table_name}.{column_name}"
logger.info(f"📝 Full target: {target_column}")
# 🚀 PROGRESS: Step 1 - Verify target column exists
logger.info("🔍 Step 1/4: Verifying target column exists...")
verify_start = time.time()
if not await self._verify_column_exists(connection, full_table_name, column_name):
logger.error(f"❌ Column {column_name} not found in table {full_table_name}")
return {"error": f"Column {column_name} not found in table {full_table_name}"}
verify_time = time.time() - verify_start
logger.info(f"✅ Column verified in {verify_time:.2f}s")
# 🚀 PROGRESS: Step 2 - Analyze SQL logs for lineage relationships
logger.info(f"📊 Step 2/4: Analyzing SQL logs for lineage (depth={depth})...")
lineage_start = time.time()
source_chain = await self._analyze_sql_logs_for_lineage(
connection, full_table_name, column_name, depth
)
lineage_time = time.time() - lineage_start
logger.info(f"✅ Found {len(source_chain)} lineage relationships in {lineage_time:.2f}s")
# 🚀 PROGRESS: Step 3 - Analyze downstream usage
logger.info("⬇️ Step 3/4: Analyzing downstream column usage...")
downstream_start = time.time()
downstream_usage = await self._analyze_downstream_column_usage(
connection, full_table_name, column_name
)
downstream_time = time.time() - downstream_start
logger.info(f"✅ Found {len(downstream_usage)} downstream usages in {downstream_time:.2f}s")
# 🚀 PROGRESS: Step 4 - Extract transformation rules
logger.info("🔄 Step 4/4: Extracting transformation rules...")
transform_start = time.time()
transformation_rules = await self._extract_transformation_rules(
connection, full_table_name, column_name
)
transform_time = time.time() - transform_start
logger.info(f"✅ Found {len(transformation_rules)} transformation rules in {transform_time:.2f}s")
execution_time = time.time() - start_time
return {
"target_column": target_column,
"analysis_timestamp": datetime.now().isoformat(),
"execution_time_seconds": round(execution_time, 3),
"lineage_depth": depth,
"source_chain": source_chain,
"downstream_usage": downstream_usage,
"transformation_rules": transformation_rules,
"lineage_confidence": self._calculate_lineage_confidence(source_chain),
"impact_analysis": {
"upstream_dependencies": len(source_chain),
"downstream_dependencies": len(downstream_usage),
"risk_level": self._assess_lineage_risk(source_chain, downstream_usage)
}
}
except Exception as e:
logger.error(f"Column lineage tracing failed for {table_name}.{column_name}: {str(e)}")
return {
"error": str(e),
"target_column": f"{table_name}.{column_name}",
"analysis_timestamp": datetime.now().isoformat()
}
async def monitor_data_freshness(
self,
tables: Optional[List[str]] = None,
time_threshold_hours: int = 24,
catalog_name: Optional[str] = None,
db_name: Optional[str] = None
) -> Dict[str, Any]:
"""
Data freshness monitoring
Args:
tables: List of tables to monitor, empty means monitor all tables
time_threshold_hours: Freshness threshold (hours)
catalog_name: Catalog name
db_name: Database name
"""
try:
start_time = time.time()
connection = await self.connection_manager.get_connection("query")
# 1. Get list of tables to monitor
if not tables:
tables = await self._get_all_tables(connection, catalog_name, db_name)
# 2. Analyze freshness of each table
table_freshness = {}
fresh_count = 0
stale_count = 0
for table in tables:
full_table_name = self._build_full_table_name(table, catalog_name, db_name)
freshness_info = await self._analyze_table_freshness(
connection, full_table_name, time_threshold_hours
)
table_freshness[table] = freshness_info
if freshness_info["status"] == "fresh":
fresh_count += 1
else:
stale_count += 1
# 3. Calculate overall freshness score
total_tables = len(tables)
overall_freshness_score = fresh_count / total_tables if total_tables > 0 else 0
# 4. Identify data flow issues
data_flow_issues = await self._identify_data_flow_issues(table_freshness)
execution_time = time.time() - start_time
return {
"monitoring_timestamp": datetime.now().isoformat(),
"execution_time_seconds": round(execution_time, 3),
"monitoring_scope": {
"catalog_name": catalog_name,
"db_name": db_name,
"time_threshold_hours": time_threshold_hours
},
"freshness_summary": {
"total_tables": total_tables,
"fresh_tables": fresh_count,
"stale_tables": stale_count,
"overall_freshness_score": round(overall_freshness_score, 3)
},
"table_freshness": table_freshness,
"data_flow_issues": data_flow_issues,
"alerts": self._generate_freshness_alerts(table_freshness, time_threshold_hours)
}
except Exception as e:
logger.error(f"Data freshness monitoring failed: {str(e)}")
return {
"error": str(e),
"monitoring_timestamp": datetime.now().isoformat()
}
# ==================== Private Helper Methods ====================
def _build_full_table_name(self, table_name: str, catalog_name: Optional[str], db_name: Optional[str]) -> str:
"""Build full table name - use three-level naming convention"""
# Default catalog is internal for internal tables
effective_catalog = catalog_name if catalog_name else "internal"
if db_name:
return f"{effective_catalog}.{db_name}.{table_name}"
else:
# If db_name is not provided, need to determine current database
return f"{effective_catalog}.{table_name}"
async def _get_table_basic_info(self, connection, table_name: str) -> Optional[Dict]:
"""Get table basic information"""
try:
# Try to get table row count
count_sql = f"SELECT COUNT(*) as row_count FROM {table_name}"
result = await connection.execute(count_sql)
if result.data:
return {"row_count": result.data[0]["row_count"]}
return None
except Exception as e:
logger.warning(f"Failed to get basic info for table {table_name}: {str(e)}")
return {"row_count": 0}
async def _get_table_columns_info(self, connection, table_name: str, catalog_name: Optional[str], db_name: Optional[str]) -> List[Dict]:
"""Get table column information"""
try:
# Build query conditions
where_conditions = [f"table_name = '{table_name}'"]
if db_name:
where_conditions.append(f"table_schema = '{db_name}'")
else:
where_conditions.append("table_schema = DATABASE()")
columns_sql = f"""
SELECT
column_name,
data_type,
is_nullable,
column_comment,
ordinal_position
FROM information_schema.columns
WHERE {' AND '.join(where_conditions)}
ORDER BY ordinal_position
"""
result = await connection.execute(columns_sql)
return result.data if result.data else []
except Exception as e:
logger.warning(f"Failed to get columns info for table {table_name}: {str(e)}")
return []
async def _analyze_column_completeness(self, connection, table_name: str, columns_info: List[Dict]) -> Dict[str, Any]:
"""Analyze column completeness"""
column_completeness = {}
for column in columns_info:
column_name = column["column_name"]
try:
# Calculate null value statistics
null_sql = f"""
SELECT
COUNT(*) as total_count,
COUNT({column_name}) as non_null_count,
COUNT(*) - COUNT({column_name}) as null_count
FROM {table_name}
"""
result = await connection.execute(null_sql)
if result.data:
stats = result.data[0]
total_count = stats["total_count"]
null_count = stats["null_count"]
null_rate = null_count / total_count if total_count > 0 else 0
completeness_score = 1.0 - null_rate
column_completeness[column_name] = {
"data_type": column["data_type"],
"is_nullable": column["is_nullable"],
"total_count": total_count,
"null_count": null_count,
"non_null_count": stats["non_null_count"],
"null_rate": round(null_rate, 4),
"completeness_score": round(completeness_score, 4)
}
except Exception as e:
logger.warning(f"Failed to analyze completeness for column {column_name}: {str(e)}")
column_completeness[column_name] = {
"error": str(e),
"completeness_score": 0.0
}
return column_completeness
async def _check_business_rule_compliance(self, connection, table_name: str, business_rules: List[Dict], total_rows: int) -> Dict[str, Any]:
"""Check business rule compliance"""
compliance_results = {}
for rule in business_rules:
rule_name = rule.get("rule_name", "unknown")
sql_condition = rule.get("sql_condition", "")
if not sql_condition:
continue
try:
# Check number of records meeting conditions
compliance_sql = f"""
SELECT
COUNT(*) as total_count,
SUM(CASE WHEN {sql_condition} THEN 1 ELSE 0 END) as pass_count
FROM {table_name}
"""
result = await connection.execute(compliance_sql)
if result.data:
stats = result.data[0]
pass_count = stats["pass_count"] or 0
fail_count = total_rows - pass_count
pass_rate = pass_count / total_rows if total_rows > 0 else 0
compliance_results[rule_name] = {
"rule_condition": sql_condition,
"total_records": total_rows,
"pass_count": pass_count,
"fail_count": fail_count,
"pass_rate": round(pass_rate, 4),
"compliance_score": round(pass_rate, 4)
}
except Exception as e:
logger.warning(f"Failed to check business rule {rule_name}: {str(e)}")
compliance_results[rule_name] = {
"error": str(e),
"compliance_score": 0.0
}
return compliance_results
async def _detect_data_integrity_issues(self, connection, table_name: str, columns_info: List[Dict]) -> List[Dict]:
"""Detect data integrity issues"""
issues = []
try:
# Detect duplicate values in primary key fields
primary_key_columns = [col["column_name"] for col in columns_info if "primary" in col.get("column_comment", "").lower()]
for pk_col in primary_key_columns:
duplicate_sql = f"""
SELECT COUNT(*) as duplicate_count
FROM (
SELECT {pk_col}, COUNT(*) as cnt
FROM {table_name}
WHERE {pk_col} IS NOT NULL
GROUP BY {pk_col}
HAVING COUNT(*) > 1
) t
"""
result = await connection.execute(duplicate_sql)
if result.data and result.data[0]["duplicate_count"] > 0:
issues.append({
"type": "duplicate_primary_keys",
"column": pk_col,
"count": result.data[0]["duplicate_count"],
"severity": "high",
"description": f"Found duplicate values in primary key column {pk_col}"
})
except Exception as e:
logger.warning(f"Failed to detect integrity issues: {str(e)}")
issues.append({
"type": "detection_error",
"error": str(e),
"severity": "unknown"
})
return issues
def _calculate_completeness_score(self, column_completeness: Dict, business_rule_compliance: Dict) -> float:
"""Calculate overall completeness score"""
if not column_completeness:
return 0.0
# Calculate column completeness average score
column_scores = [
col_info.get("completeness_score", 0.0)
for col_info in column_completeness.values()
if isinstance(col_info, dict) and "completeness_score" in col_info
]
avg_column_score = sum(column_scores) / len(column_scores) if column_scores else 0.0
# Calculate business rule compliance average score
compliance_scores = [
rule_info.get("compliance_score", 0.0)
for rule_info in business_rule_compliance.values()
if isinstance(rule_info, dict) and "compliance_score" in rule_info
]
avg_compliance_score = sum(compliance_scores) / len(compliance_scores) if compliance_scores else 1.0
# Comprehensive score (column completeness weight 70%, business rules weight 30%)
overall_score = avg_column_score * 0.7 + avg_compliance_score * 0.3
return round(overall_score, 4)
def _generate_completeness_recommendations(self, column_completeness: Dict, integrity_issues: List[Dict]) -> List[Dict]:
"""Generate completeness improvement recommendations"""
recommendations = []
# Generate recommendations based on column completeness
for col_name, col_info in column_completeness.items():
if isinstance(col_info, dict):
null_rate = col_info.get("null_rate", 0)
if null_rate > 0.1: # Null rate exceeds 10%
recommendations.append({
"type": "high_null_rate",
"column": col_name,
"priority": "high" if null_rate > 0.5 else "medium",
"description": f"Column {col_name} has high null rate ({null_rate:.1%})",
"suggested_action": "Review data collection process or add data validation"
})
# Generate recommendations based on integrity issues
for issue in integrity_issues:
if issue["type"] == "duplicate_primary_keys":
recommendations.append({
"type": "data_deduplication",
"column": issue["column"],
"priority": "high",
"description": f"Duplicate primary key values found in {issue['column']}",
"suggested_action": "Implement unique constraint or data deduplication process"
})
return recommendations
async def _verify_column_exists(self, connection, table_name: str, column_name: str) -> bool:
"""Verify if column exists"""
try:
# Simple verification method: try to query the column
verify_sql = f"SELECT {column_name} FROM {table_name} LIMIT 1"
await connection.execute(verify_sql)
return True
except Exception:
return False
async def _analyze_sql_logs_for_lineage(self, connection, table_name: str, column_name: str, depth: int) -> List[Dict]:
"""Analyze SQL logs to get lineage relationships (simplified implementation)"""
# Note: This is a simplified implementation, actual environment needs to analyze audit logs
source_chain = []
try:
# Try to find related INSERT/CREATE TABLE AS SELECT statements from audit logs (one year range)
audit_sql = """
SELECT
stmt as sql_statement,
`time` as execution_time,
`user` as user_name
FROM internal.__internal_schema.audit_log
WHERE stmt LIKE '%{}%'
AND (stmt LIKE '%INSERT%' OR stmt LIKE '%CREATE%' OR stmt LIKE '%SELECT%')
AND `time` >= DATE_SUB(NOW(), INTERVAL 1 YEAR)
ORDER BY `time` DESC
LIMIT 50
""".format(table_name.split('.')[-1]) # Use the last part of table name
result = await connection.execute(audit_sql)
if result.data:
for i, log_entry in enumerate(result.data[:depth]):
# Simplified lineage analysis: extract possible source tables
sql_stmt = log_entry.get("sql_statement", "")
source_tables = self._extract_source_tables_from_sql(sql_stmt)
if source_tables:
# Handle datetime serialization issue
execution_time = log_entry.get("execution_time")
if execution_time and hasattr(execution_time, 'isoformat'):
execution_time = execution_time.isoformat()
elif execution_time:
execution_time = str(execution_time)
source_chain.append({
"level": i + 1,
"source_table": source_tables[0], # Take the first as main source table
"source_column": column_name, # Simplified: assume same name
"transformation": self._extract_transformation_from_sql(sql_stmt, column_name),
"confidence": 0.8 - (i * 0.1), # Decreasing confidence
"execution_time": execution_time,
"user": log_entry.get("user_name")
})
except Exception as e:
logger.warning(f"Failed to analyze SQL logs for lineage: {str(e)}")
# If unable to get from audit logs, return basic information
source_chain = [{
"level": 1,
"source_table": "unknown_source",
"source_column": column_name,
"transformation": "unknown",
"confidence": 0.3,
"note": "Limited lineage information available"
}]
return source_chain
def _extract_source_tables_from_sql(self, sql: str) -> List[str]:
"""Extract source table names from SQL statement (simplified implementation)"""
# Simplified regex to match table names in FROM clause
from_pattern = r'\bFROM\s+([a-zA-Z_][a-zA-Z0-9_]*(?:\.[a-zA-Z_][a-zA-Z0-9_]*)*)'
join_pattern = r'\bJOIN\s+([a-zA-Z_][a-zA-Z0-9_]*(?:\.[a-zA-Z_][a-zA-Z0-9_]*)*)'
tables = []
# Find tables in FROM clause
from_matches = re.findall(from_pattern, sql, re.IGNORECASE)
tables.extend(from_matches)
# Find tables in JOIN clause
join_matches = re.findall(join_pattern, sql, re.IGNORECASE)
tables.extend(join_matches)
return list(set(tables)) # Remove duplicates
def _extract_transformation_from_sql(self, sql: str, column_name: str) -> str:
"""Extract field transformation rules from SQL statement (simplified implementation)"""
# Simplified implementation: find expressions containing target field
lines = sql.split('\n')
for line in lines:
if column_name in line and ('SELECT' in line.upper() or '=' in line):
return line.strip()
return "direct_copy"
async def _analyze_downstream_column_usage(self, connection, table_name: str, column_name: str) -> List[Dict]:
"""Analyze downstream usage of field (simplified implementation)"""
downstream_usage = []
try:
# Find other tables that might use this field (through audit logs, one year range)
usage_sql = """
SELECT DISTINCT
stmt as sql_statement
FROM internal.__internal_schema.audit_log
WHERE stmt LIKE '%{}%'
AND stmt LIKE '%{}%'
AND stmt LIKE '%SELECT%'
AND `time` >= DATE_SUB(NOW(), INTERVAL 1 YEAR)
LIMIT 20
""".format(table_name.split('.')[-1], column_name)
result = await connection.execute(usage_sql)
if result.data:
for entry in result.data:
sql_stmt = entry.get("sql_statement", "")
target_tables = self._extract_target_tables_from_sql(sql_stmt)
for target_table in target_tables:
if target_table != table_name.split('.')[-1]: # Not the source table itself
downstream_usage.append({
"table": target_table,
"column": column_name, # Simplified: assume same name
"usage_type": "select_reference",
"confidence": 0.7
})
except Exception as e:
logger.warning(f"Failed to analyze downstream usage: {str(e)}")
return downstream_usage
def _extract_target_tables_from_sql(self, sql: str) -> List[str]:
"""Extract target table names from SQL statement"""
# Find target tables in INSERT INTO or CREATE TABLE statements
insert_pattern = r'\bINSERT\s+INTO\s+([a-zA-Z_][a-zA-Z0-9_]*(?:\.[a-zA-Z_][a-zA-Z0-9_]*)*)'
create_pattern = r'\bCREATE\s+TABLE\s+([a-zA-Z_][a-zA-Z0-9_]*(?:\.[a-zA-Z_][a-zA-Z0-9_]*)*)'
tables = []
insert_matches = re.findall(insert_pattern, sql, re.IGNORECASE)
tables.extend(insert_matches)
create_matches = re.findall(create_pattern, sql, re.IGNORECASE)
tables.extend(create_matches)
return list(set(tables))
async def _extract_transformation_rules(self, connection, table_name: str, column_name: str) -> List[Dict]:
"""Extract field transformation rules"""
# Simplified implementation: return basic transformation information
return [{
"transformation_type": "unknown",
"description": "Transformation rules analysis requires detailed ETL metadata",
"confidence": 0.5
}]
def _calculate_lineage_confidence(self, source_chain: List[Dict]) -> float:
"""Calculate overall confidence of lineage tracing"""
if not source_chain:
return 0.0
confidences = [item.get("confidence", 0.0) for item in source_chain]
return round(sum(confidences) / len(confidences), 3)
def _assess_lineage_risk(self, source_chain: List[Dict], downstream_usage: List[Dict]) -> str:
"""Assess lineage risk level"""
if len(downstream_usage) > 10:
return "high"
elif len(downstream_usage) > 5:
return "medium"
else:
return "low"
async def _get_all_tables(self, connection, catalog_name: Optional[str], db_name: Optional[str]) -> List[str]:
"""Get list of all tables"""
try:
where_conditions = []
if db_name:
where_conditions.append(f"table_schema = '{db_name}'")
else:
where_conditions.append("table_schema = DATABASE()")
where_clause = " AND ".join(where_conditions) if where_conditions else "1=1"
tables_sql = f"""
SELECT table_name
FROM information_schema.tables
WHERE {where_clause}
AND table_type = 'BASE TABLE'
ORDER BY table_name
"""
result = await connection.execute(tables_sql)
return [row["table_name"] for row in result.data] if result.data else []
except Exception as e:
logger.warning(f"Failed to get table list: {str(e)}")
return []
async def _analyze_table_freshness(self, connection, table_name: str, threshold_hours: int) -> Dict[str, Any]:
"""Analyze freshness of single table"""
try:
# Try multiple methods to get table's last update time
freshness_methods = [
self._get_freshness_from_partition_info,
self._get_freshness_from_max_timestamp,
self._get_freshness_from_table_metadata
]
last_update = None
method_used = "unknown"
for method in freshness_methods:
try:
result = await method(connection, table_name)
if result:
last_update = result["last_update"]
method_used = result["method"]
break
except Exception as e:
continue
if not last_update:
return {
"last_update": None,
"staleness_hours": None,
"freshness_score": 0.0,
"status": "unknown",
"method_used": "none",
"error": "Unable to determine last update time"
}
# Calculate data staleness
now = datetime.now()
if isinstance(last_update, str):
last_update = datetime.fromisoformat(last_update.replace('Z', '+00:00'))
staleness_hours = (now - last_update).total_seconds() / 3600
# Calculate freshness score and status
if staleness_hours <= threshold_hours:
status = "fresh"
freshness_score = max(0.0, 1.0 - (staleness_hours / threshold_hours))
else:
status = "stale"
freshness_score = max(0.0, 1.0 - (staleness_hours / (threshold_hours * 2)))
return {
"last_update": last_update.isoformat() if hasattr(last_update, 'isoformat') else str(last_update),
"staleness_hours": round(staleness_hours, 2),
"freshness_score": round(freshness_score, 3),
"status": status,
"method_used": method_used,
"threshold_hours": threshold_hours
}
except Exception as e:
logger.warning(f"Failed to analyze freshness for table {table_name}: {str(e)}")
return {
"last_update": None,
"staleness_hours": None,
"freshness_score": 0.0,
"status": "error",
"error": str(e)
}
async def _get_freshness_from_partition_info(self, connection, table_name: str) -> Optional[Dict]:
"""Get freshness from partition information"""
try:
# Query partition information (if table has partitions)
partition_sql = f"""
SELECT MAX(CREATE_TIME) as last_update
FROM information_schema.partitions
WHERE table_name = '{table_name.split('.')[-1]}'
AND CREATE_TIME IS NOT NULL
"""
result = await connection.execute(partition_sql)
if result.data and result.data[0]["last_update"]:
return {
"last_update": result.data[0]["last_update"],
"method": "partition_info"
}
return None
except Exception:
return None
async def _get_freshness_from_max_timestamp(self, connection, table_name: str) -> Optional[Dict]:
"""Get freshness from timestamp fields"""
try:
# Find possible timestamp fields
timestamp_columns = await self._find_timestamp_columns(connection, table_name)
if timestamp_columns:
max_time_sql = f"""
SELECT MAX({timestamp_columns[0]}) as last_update
FROM {table_name}
"""
result = await connection.execute(max_time_sql)
if result.data and result.data[0]["last_update"]:
return {
"last_update": result.data[0]["last_update"],
"method": f"max_timestamp({timestamp_columns[0]})"
}
return None
except Exception:
return None
async def _get_freshness_from_table_metadata(self, connection, table_name: str) -> Optional[Dict]:
"""Get freshness from table metadata"""
try:
# Query table's update time
metadata_sql = f"""
SELECT UPDATE_TIME as last_update
FROM information_schema.tables
WHERE table_name = '{table_name.split('.')[-1]}'
AND UPDATE_TIME IS NOT NULL
"""
result = await connection.execute(metadata_sql)
if result.data and result.data[0]["last_update"]:
return {
"last_update": result.data[0]["last_update"],
"method": "table_metadata"
}
return None
except Exception:
return None
async def _find_timestamp_columns(self, connection, table_name: str) -> List[str]:
"""Find possible timestamp fields"""
try:
timestamp_sql = f"""
SELECT column_name
FROM information_schema.columns
WHERE table_name = '{table_name.split('.')[-1]}'
AND (
data_type IN ('datetime', 'timestamp', 'date')
OR column_name LIKE '%time%'
OR column_name LIKE '%date%'
OR column_name LIKE '%created%'
OR column_name LIKE '%updated%'
)
ORDER BY
CASE
WHEN column_name LIKE '%updated%' THEN 1
WHEN column_name LIKE '%created%' THEN 2
WHEN column_name LIKE '%time%' THEN 3
ELSE 4
END
"""
result = await connection.execute(timestamp_sql)
return [row["column_name"] for row in result.data] if result.data else []
except Exception:
return []
async def _identify_data_flow_issues(self, table_freshness: Dict[str, Any]) -> List[Dict]:
"""Identify data flow issues"""
issues = []
# Identify consecutively stale tables (may indicate ETL process issues)
stale_tables = [
table_name for table_name, info in table_freshness.items()
if info.get("status") == "stale"
]
if len(stale_tables) > len(table_freshness) * 0.3: # More than 30% of tables are stale
issues.append({
"issue_type": "widespread_staleness",
"severity": "high",
"affected_tables": len(stale_tables),
"total_tables": len(table_freshness),
"description": f"High percentage of stale tables ({len(stale_tables)}/{len(table_freshness)})",
"possible_causes": ["ETL pipeline failure", "Data source issues", "Processing delays"]
})
# Identify particularly stale tables
very_stale_tables = [
(table_name, info.get("staleness_hours", 0))
for table_name, info in table_freshness.items()
if info.get("staleness_hours", 0) > 72 # More than 3 days
]
if very_stale_tables:
issues.append({
"issue_type": "very_stale_data",
"severity": "medium",
"affected_tables": [table for table, _ in very_stale_tables],
"max_staleness_hours": max(hours for _, hours in very_stale_tables),
"description": "Some tables have very stale data (>72 hours)",
"recommendation": "Check data ingestion processes for affected tables"
})
return issues
def _generate_freshness_alerts(self, table_freshness: Dict[str, Any], threshold_hours: int) -> List[Dict]:
"""Generate freshness alerts"""
alerts = []
for table_name, info in table_freshness.items():
staleness_hours = info.get("staleness_hours")
status = info.get("status")
if status == "stale" and staleness_hours:
if staleness_hours > threshold_hours * 2: # Exceeds threshold by 2x
alert_level = "critical"
elif staleness_hours > threshold_hours * 1.5: # Exceeds threshold by 1.5x
alert_level = "warning"
else:
alert_level = "info"
alerts.append({
"alert_level": alert_level,
"table_name": table_name,
"staleness_hours": staleness_hours,
"threshold_hours": threshold_hours,
"message": f"Table {table_name} is stale ({staleness_hours:.1f} hours old, threshold: {threshold_hours}h)",
"timestamp": datetime.now().isoformat()
})
elif status == "error":
alerts.append({
"alert_level": "error",
"table_name": table_name,
"message": f"Unable to determine freshness for table {table_name}",
"error": info.get("error"),
"timestamp": datetime.now().isoformat()
})
return alerts

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,978 @@
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.
"""
Dependency Analysis Tools Module
Provides data flow dependency analysis and impact assessment capabilities
"""
import time
import re
from datetime import datetime
from typing import Any, Dict, List, Optional, Set, Tuple
from collections import defaultdict, deque
from .db import DorisConnectionManager
from .logger import get_logger
logger = get_logger(__name__)
class DependencyAnalysisTools:
"""Dependency analysis tools for data flow and impact assessment"""
def __init__(self, connection_manager: DorisConnectionManager):
self.connection_manager = connection_manager
logger.info("DependencyAnalysisTools initialized")
async def analyze_data_flow_dependencies(
self,
target_table: Optional[str] = None,
analysis_depth: int = 3,
include_views: bool = True,
catalog_name: Optional[str] = None,
db_name: Optional[str] = None
) -> Dict[str, Any]:
"""
Analyze data flow dependencies and impact relationships
Args:
target_table: Specific table to analyze (if None, analyzes all tables)
analysis_depth: Maximum depth for dependency traversal
include_views: Whether to include views in dependency analysis
catalog_name: Catalog name
db_name: Database name
Returns:
Comprehensive dependency analysis results
"""
try:
start_time = time.time()
connection = await self.connection_manager.get_connection("query")
# 1. Get table metadata and relationships
tables_metadata = await self._get_tables_metadata(connection, catalog_name, db_name, include_views)
if not tables_metadata:
return {
"error": "No tables found for dependency analysis",
"analysis_timestamp": datetime.now().isoformat()
}
# 2. Build dependency graph from SQL analysis
dependency_graph = await self._build_dependency_graph(connection, tables_metadata, analysis_depth)
# 3. Analyze specific table or all tables
if target_table:
# Analyze specific table
table_analysis = await self._analyze_single_table_dependencies(
target_table, dependency_graph, tables_metadata
)
impact_analysis = await self._calculate_impact_analysis(
target_table, dependency_graph, "both"
)
else:
# Analyze all tables
table_analysis = await self._analyze_all_tables_dependencies(
dependency_graph, tables_metadata
)
impact_analysis = await self._calculate_global_impact_analysis(dependency_graph)
# 4. Generate insights and recommendations
dependency_insights = await self._generate_dependency_insights(
dependency_graph, table_analysis, impact_analysis
)
execution_time = time.time() - start_time
return {
"analysis_target": target_table or "all_tables",
"analysis_timestamp": datetime.now().isoformat(),
"execution_time_seconds": round(execution_time, 3),
"tables_analyzed": len(tables_metadata),
"dependency_graph_stats": self._get_dependency_graph_stats(dependency_graph),
"table_dependencies": table_analysis,
"impact_analysis": impact_analysis,
"dependency_insights": dependency_insights,
"recommendations": self._generate_dependency_recommendations(dependency_insights)
}
except Exception as e:
logger.error(f"Data flow dependency analysis failed: {str(e)}")
return {
"error": str(e),
"analysis_timestamp": datetime.now().isoformat()
}
# ==================== Private Helper Methods ====================
async def _get_tables_metadata(self, connection, catalog_name: Optional[str], db_name: Optional[str], include_views: bool) -> List[Dict]:
"""Get metadata for all tables and views"""
try:
# Build conditions for query
where_conditions = []
if db_name:
where_conditions.append(f"table_schema = '{db_name}'")
else:
where_conditions.append("table_schema = DATABASE()")
table_types = ["'BASE TABLE'"]
if include_views:
table_types.append("'VIEW'")
where_conditions.append(f"table_type IN ({','.join(table_types)})")
metadata_sql = f"""
SELECT
table_schema as schema_name,
table_name,
table_type,
table_comment,
table_rows,
data_length
FROM information_schema.tables
WHERE {' AND '.join(where_conditions)}
ORDER BY table_schema, table_name
"""
result = await connection.execute(metadata_sql)
return result.data if result.data else []
except Exception as e:
logger.warning(f"Failed to get tables metadata: {str(e)}")
return []
async def _build_dependency_graph(self, connection, tables_metadata: List[Dict], analysis_depth: int) -> Dict[str, Dict]:
"""Build dependency graph by analyzing SQL statements and DDL"""
dependency_graph = defaultdict(lambda: {
"upstream_dependencies": set(),
"downstream_dependencies": set(),
"table_type": "unknown",
"dependency_strength": {},
"sql_patterns": []
})
# Initialize graph with table metadata
for table in tables_metadata:
table_name = table["table_name"]
schema_name = table.get("schema_name", "")
full_table_name = f"{schema_name}.{table_name}" if schema_name else table_name
dependency_graph[full_table_name]["table_type"] = table["table_type"]
# 1. Analyze view definitions for dependencies
await self._analyze_view_dependencies(connection, dependency_graph, tables_metadata)
# 2. Analyze audit logs for runtime dependencies
await self._analyze_runtime_dependencies(connection, dependency_graph, analysis_depth)
# 3. Analyze foreign key relationships
await self._analyze_foreign_key_dependencies(connection, dependency_graph, tables_metadata)
return dict(dependency_graph)
async def _analyze_view_dependencies(self, connection, dependency_graph: Dict, tables_metadata: List[Dict]) -> None:
"""Analyze view definitions to extract table dependencies"""
try:
for table in tables_metadata:
if table["table_type"] == "VIEW":
table_name = table["table_name"]
schema_name = table.get("schema_name", "")
# Get view definition
view_def_sql = f"SHOW CREATE VIEW {schema_name}.{table_name}" if schema_name else f"SHOW CREATE VIEW {table_name}"
try:
result = await connection.execute(view_def_sql)
if result.data and len(result.data) > 0:
# Extract view definition from result
view_definition = ""
for row in result.data:
for key, value in row.items():
if "create" in key.lower() and value:
view_definition = str(value)
break
if view_definition:
# Extract table dependencies from view definition
referenced_tables = self._extract_table_references(view_definition)
full_view_name = f"{schema_name}.{table_name}" if schema_name else table_name
for ref_table in referenced_tables:
# Add upstream dependency
dependency_graph[full_view_name]["upstream_dependencies"].add(ref_table)
dependency_graph[full_view_name]["dependency_strength"][ref_table] = "direct"
# Add downstream dependency for referenced table
dependency_graph[ref_table]["downstream_dependencies"].add(full_view_name)
dependency_graph[full_view_name]["sql_patterns"].append({
"pattern_type": "view_definition",
"referenced_table": ref_table,
"confidence": 1.0
})
except Exception as e:
logger.warning(f"Failed to analyze view {table_name}: {str(e)}")
continue
except Exception as e:
logger.warning(f"Failed to analyze view dependencies: {str(e)}")
async def _analyze_runtime_dependencies(self, connection, dependency_graph: Dict, analysis_depth: int) -> None:
"""Analyze audit logs to discover runtime table dependencies"""
try:
# Get recent SQL statements from audit logs
audit_sql = """
SELECT
`stmt` as sql_statement,
`user` as user_name,
COUNT(*) as frequency
FROM internal.__internal_schema.audit_log
WHERE `stmt` IS NOT NULL
AND `stmt` != ''
AND `time` >= DATE_SUB(NOW(), INTERVAL 1 YEAR)
GROUP BY `stmt`, `user`
HAVING frequency > 1
ORDER BY frequency DESC
LIMIT 1000
"""
result = await connection.execute(audit_sql)
if result.data:
for row in result.data:
sql_statement = row.get("sql_statement", "")
frequency = row.get("frequency", 1)
if sql_statement:
# Extract table references from SQL
referenced_tables = self._extract_table_references(sql_statement)
if len(referenced_tables) > 1:
# Infer dependencies from multi-table queries
self._infer_dependencies_from_sql(
dependency_graph, sql_statement, referenced_tables, frequency
)
except Exception as e:
logger.warning(f"Failed to analyze runtime dependencies: {str(e)}")
async def _analyze_foreign_key_dependencies(self, connection, dependency_graph: Dict, tables_metadata: List[Dict]) -> None:
"""Analyze foreign key constraints for explicit dependencies"""
try:
# Get foreign key information
fk_sql = """
SELECT
TABLE_SCHEMA as schema_name,
TABLE_NAME as table_name,
COLUMN_NAME as column_name,
REFERENCED_TABLE_SCHEMA as ref_schema,
REFERENCED_TABLE_NAME as ref_table_name,
REFERENCED_COLUMN_NAME as ref_column_name
FROM information_schema.KEY_COLUMN_USAGE
WHERE REFERENCED_TABLE_NAME IS NOT NULL
"""
result = await connection.execute(fk_sql)
if result.data:
for row in result.data:
schema_name = row.get("schema_name", "")
table_name = row["table_name"]
ref_schema = row.get("ref_schema", "")
ref_table_name = row["ref_table_name"]
# Build full table names
full_table_name = f"{schema_name}.{table_name}" if schema_name else table_name
full_ref_table = f"{ref_schema}.{ref_table_name}" if ref_schema else ref_table_name
# Add foreign key dependency
dependency_graph[full_table_name]["upstream_dependencies"].add(full_ref_table)
dependency_graph[full_table_name]["dependency_strength"][full_ref_table] = "foreign_key"
dependency_graph[full_ref_table]["downstream_dependencies"].add(full_table_name)
dependency_graph[full_table_name]["sql_patterns"].append({
"pattern_type": "foreign_key",
"referenced_table": full_ref_table,
"confidence": 1.0,
"column": row["column_name"],
"ref_column": row["ref_column_name"]
})
except Exception as e:
logger.warning(f"Failed to analyze foreign key dependencies: {str(e)}")
def _extract_table_references(self, sql: str) -> List[str]:
"""Extract table references from SQL statement"""
if not sql:
return []
# Normalize SQL
sql = re.sub(r'/\*.*?\*/', '', sql, flags=re.DOTALL) # Remove comments
sql = re.sub(r'--.*', '', sql) # Remove line comments
sql = sql.upper()
table_references = []
# Pattern to match table names in various contexts
patterns = [
r'\bFROM\s+([`"]?[a-zA-Z_][a-zA-Z0-9_]*(?:\.[a-zA-Z_][a-zA-Z0-9_]*)*[`"]?)',
r'\bJOIN\s+([`"]?[a-zA-Z_][a-zA-Z0-9_]*(?:\.[a-zA-Z_][a-zA-Z0-9_]*)*[`"]?)',
r'\bINTO\s+([`"]?[a-zA-Z_][a-zA-Z0-9_]*(?:\.[a-zA-Z_][a-zA-Z0-9_]*)*[`"]?)',
r'\bUPDATE\s+([`"]?[a-zA-Z_][a-zA-Z0-9_]*(?:\.[a-zA-Z_][a-zA-Z0-9_]*)*[`"]?)',
r'\bDELETE\s+FROM\s+([`"]?[a-zA-Z_][a-zA-Z0-9_]*(?:\.[a-zA-Z_][a-zA-Z0-9_]*)*[`"]?)',
r'\bINSERT\s+INTO\s+([`"]?[a-zA-Z_][a-zA-Z0-9_]*(?:\.[a-zA-Z_][a-zA-Z0-9_]*)*[`"]?)'
]
for pattern in patterns:
matches = re.findall(pattern, sql, re.IGNORECASE)
for match in matches:
# Clean up table name
table_name = match.strip('`"\'').split()[0] # Remove quotes and aliases
if table_name and not self._is_sql_keyword(table_name):
table_references.append(table_name.lower())
return list(set(table_references))
def _is_sql_keyword(self, word: str) -> bool:
"""Check if word is a SQL keyword"""
keywords = {
'SELECT', 'FROM', 'WHERE', 'JOIN', 'INNER', 'LEFT', 'RIGHT', 'OUTER',
'ON', 'AND', 'OR', 'NOT', 'IN', 'EXISTS', 'BETWEEN', 'LIKE',
'INSERT', 'UPDATE', 'DELETE', 'CREATE', 'ALTER', 'DROP', 'INDEX',
'TABLE', 'VIEW', 'DATABASE', 'SCHEMA', 'PRIMARY', 'KEY', 'FOREIGN',
'REFERENCES', 'CONSTRAINT', 'NULL', 'DEFAULT', 'AUTO_INCREMENT'
}
return word.upper() in keywords
def _infer_dependencies_from_sql(self, dependency_graph: Dict, sql: str, referenced_tables: List[str], frequency: int) -> None:
"""Infer table dependencies from SQL patterns"""
# Analyze SQL pattern to determine dependency relationships
sql_upper = sql.upper()
# Look for INSERT ... SELECT patterns
if 'INSERT' in sql_upper and 'SELECT' in sql_upper:
# Find target table (after INSERT INTO)
insert_match = re.search(r'INSERT\s+INTO\s+([a-zA-Z_][a-zA-Z0-9_.]*)', sql_upper)
if insert_match:
target_table = insert_match.group(1).lower()
# All other tables are dependencies
for ref_table in referenced_tables:
if ref_table != target_table:
dependency_graph[target_table]["upstream_dependencies"].add(ref_table)
dependency_graph[ref_table]["downstream_dependencies"].add(target_table)
# Calculate confidence based on frequency
confidence = min(0.9, 0.3 + (frequency / 100))
dependency_graph[target_table]["sql_patterns"].append({
"pattern_type": "insert_select",
"referenced_table": ref_table,
"confidence": confidence,
"frequency": frequency
})
# Look for CREATE TABLE AS SELECT patterns
elif 'CREATE' in sql_upper and 'SELECT' in sql_upper:
create_match = re.search(r'CREATE\s+TABLE\s+([a-zA-Z_][a-zA-Z0-9_.]*)', sql_upper)
if create_match:
target_table = create_match.group(1).lower()
for ref_table in referenced_tables:
if ref_table != target_table:
dependency_graph[target_table]["upstream_dependencies"].add(ref_table)
dependency_graph[ref_table]["downstream_dependencies"].add(target_table)
dependency_graph[target_table]["sql_patterns"].append({
"pattern_type": "create_table_as_select",
"referenced_table": ref_table,
"confidence": 0.95,
"frequency": frequency
})
async def _analyze_single_table_dependencies(self, target_table: str, dependency_graph: Dict, tables_metadata: List[Dict]) -> Dict[str, Any]:
"""Analyze dependencies for a specific table"""
if target_table not in dependency_graph:
return {"error": f"Table {target_table} not found in dependency graph"}
table_info = dependency_graph[target_table]
# Get upstream dependencies (tables this table depends on)
upstream_deps = await self._get_dependency_chain(target_table, dependency_graph, "upstream", 3)
# Get downstream dependencies (tables that depend on this table)
downstream_deps = await self._get_dependency_chain(target_table, dependency_graph, "downstream", 3)
return {
"table_name": target_table,
"table_type": table_info["table_type"],
"direct_upstream_dependencies": list(table_info["upstream_dependencies"]),
"direct_downstream_dependencies": list(table_info["downstream_dependencies"]),
"upstream_dependency_chain": upstream_deps,
"downstream_dependency_chain": downstream_deps,
"dependency_patterns": table_info["sql_patterns"],
"dependency_metrics": {
"upstream_count": len(table_info["upstream_dependencies"]),
"downstream_count": len(table_info["downstream_dependencies"]),
"total_upstream_chain": len(upstream_deps.get("all_dependencies", [])),
"total_downstream_chain": len(downstream_deps.get("all_dependencies", [])),
"dependency_depth": max(upstream_deps.get("max_depth", 0), downstream_deps.get("max_depth", 0))
}
}
async def _get_dependency_chain(self, start_table: str, dependency_graph: Dict, direction: str, max_depth: int) -> Dict[str, Any]:
"""Get full dependency chain in specified direction"""
visited = set()
all_dependencies = []
levels = []
current_level = [start_table]
depth = 0
while current_level and depth < max_depth:
next_level = []
level_deps = []
for table in current_level:
if table in visited:
continue
visited.add(table)
if direction == "upstream":
dependencies = dependency_graph.get(table, {}).get("upstream_dependencies", set())
else:
dependencies = dependency_graph.get(table, {}).get("downstream_dependencies", set())
for dep in dependencies:
if dep not in visited:
next_level.append(dep)
level_deps.append(dep)
all_dependencies.append(dep)
if level_deps:
levels.append({
"level": depth + 1,
"tables": level_deps
})
current_level = next_level
depth += 1
return {
"direction": direction,
"max_depth": depth,
"all_dependencies": list(set(all_dependencies)),
"dependency_levels": levels,
"total_count": len(set(all_dependencies))
}
async def _analyze_all_tables_dependencies(self, dependency_graph: Dict, tables_metadata: List[Dict]) -> Dict[str, Any]:
"""Analyze dependencies for all tables"""
table_stats = {}
for table_name, table_info in dependency_graph.items():
upstream_count = len(table_info["upstream_dependencies"])
downstream_count = len(table_info["downstream_dependencies"])
table_stats[table_name] = {
"table_type": table_info["table_type"],
"upstream_count": upstream_count,
"downstream_count": downstream_count,
"total_connections": upstream_count + downstream_count,
"dependency_score": self._calculate_dependency_score(upstream_count, downstream_count),
"role_classification": self._classify_table_role(upstream_count, downstream_count)
}
# Find key tables
most_critical_tables = sorted(
table_stats.items(),
key=lambda x: x[1]["dependency_score"],
reverse=True
)[:10]
source_tables = [name for name, stats in table_stats.items() if stats["role_classification"] == "source"]
sink_tables = [name for name, stats in table_stats.items() if stats["role_classification"] == "sink"]
hub_tables = [name for name, stats in table_stats.items() if stats["role_classification"] == "hub"]
return {
"table_statistics": table_stats,
"summary": {
"total_tables": len(table_stats),
"source_tables": len(source_tables),
"sink_tables": len(sink_tables),
"hub_tables": len(hub_tables),
"isolated_tables": len([stats for stats in table_stats.values() if stats["total_connections"] == 0])
},
"critical_tables": [{"table": name, **stats} for name, stats in most_critical_tables],
"table_roles": {
"sources": source_tables[:10],
"sinks": sink_tables[:10],
"hubs": hub_tables[:10]
}
}
def _calculate_dependency_score(self, upstream_count: int, downstream_count: int) -> float:
"""Calculate dependency importance score for a table"""
# Score based on both incoming and outgoing dependencies
# Higher weight for downstream dependencies (impact)
return round(upstream_count * 0.3 + downstream_count * 0.7, 2)
def _classify_table_role(self, upstream_count: int, downstream_count: int) -> str:
"""Classify table role based on dependency pattern"""
if upstream_count == 0 and downstream_count > 0:
return "source" # Data source
elif upstream_count > 0 and downstream_count == 0:
return "sink" # Data destination
elif upstream_count > 2 and downstream_count > 2:
return "hub" # Data hub/transformation
elif upstream_count > 0 and downstream_count > 0:
return "intermediate" # Intermediate transformation
else:
return "isolated" # No dependencies
async def _calculate_impact_analysis(self, target_table: str, dependency_graph: Dict, direction: str) -> Dict[str, Any]:
"""Calculate impact analysis for a specific table"""
if direction == "upstream" or direction == "both":
upstream_impact = await self._calculate_upstream_impact(target_table, dependency_graph)
else:
upstream_impact = {}
if direction == "downstream" or direction == "both":
downstream_impact = await self._calculate_downstream_impact(target_table, dependency_graph)
else:
downstream_impact = {}
return {
"target_table": target_table,
"upstream_impact": upstream_impact,
"downstream_impact": downstream_impact,
"total_impact_score": self._calculate_total_impact_score(upstream_impact, downstream_impact)
}
async def _calculate_upstream_impact(self, target_table: str, dependency_graph: Dict) -> Dict[str, Any]:
"""Calculate what would be impacted if upstream dependencies fail"""
upstream_deps = dependency_graph.get(target_table, {}).get("upstream_dependencies", set())
impact_scenarios = []
for dep_table in upstream_deps:
# Simulate failure of this dependency
affected_tables = await self._simulate_table_failure_impact(dep_table, dependency_graph)
impact_scenarios.append({
"failed_dependency": dep_table,
"directly_affected_tables": len(affected_tables["direct"]),
"indirectly_affected_tables": len(affected_tables["indirect"]),
"total_affected": len(affected_tables["all"]),
"critical_affected": [table for table in affected_tables["all"]
if dependency_graph.get(table, {}).get("downstream_dependencies", set())],
"impact_severity": self._assess_impact_severity(len(affected_tables["all"]))
})
return {
"dependency_count": len(upstream_deps),
"impact_scenarios": impact_scenarios,
"max_potential_impact": max([scenario["total_affected"] for scenario in impact_scenarios], default=0),
"risk_assessment": self._assess_upstream_risk(impact_scenarios)
}
async def _calculate_downstream_impact(self, target_table: str, dependency_graph: Dict) -> Dict[str, Any]:
"""Calculate what would be impacted if target table fails"""
affected_tables = await self._simulate_table_failure_impact(target_table, dependency_graph)
return {
"direct_impact": len(affected_tables["direct"]),
"indirect_impact": len(affected_tables["indirect"]),
"total_impact": len(affected_tables["all"]),
"affected_table_details": [
{
"table_name": table,
"impact_type": "direct" if table in affected_tables["direct"] else "indirect",
"table_role": self._classify_table_role(
len(dependency_graph.get(table, {}).get("upstream_dependencies", set())),
len(dependency_graph.get(table, {}).get("downstream_dependencies", set()))
)
}
for table in affected_tables["all"]
],
"impact_severity": self._assess_impact_severity(len(affected_tables["all"]))
}
async def _simulate_table_failure_impact(self, failed_table: str, dependency_graph: Dict) -> Dict[str, List[str]]:
"""Simulate the impact of a table failure"""
direct_affected = list(dependency_graph.get(failed_table, {}).get("downstream_dependencies", set()))
# Find all indirectly affected tables using BFS
visited = {failed_table}
queue = deque(direct_affected)
indirect_affected = []
while queue:
current_table = queue.popleft()
if current_table in visited:
continue
visited.add(current_table)
indirect_affected.append(current_table)
# Add downstream dependencies to queue
downstream = dependency_graph.get(current_table, {}).get("downstream_dependencies", set())
for dep in downstream:
if dep not in visited:
queue.append(dep)
# Remove direct affected from indirect (they're already counted)
indirect_only = [table for table in indirect_affected if table not in direct_affected]
return {
"direct": direct_affected,
"indirect": indirect_only,
"all": direct_affected + indirect_only
}
def _assess_impact_severity(self, affected_count: int) -> str:
"""Assess impact severity based on affected table count"""
if affected_count == 0:
return "none"
elif affected_count <= 2:
return "low"
elif affected_count <= 5:
return "medium"
elif affected_count <= 10:
return "high"
else:
return "critical"
def _assess_upstream_risk(self, impact_scenarios: List[Dict]) -> str:
"""Assess upstream dependency risk"""
if not impact_scenarios:
return "low"
max_impact = max([scenario["total_affected"] for scenario in impact_scenarios])
high_impact_scenarios = len([s for s in impact_scenarios if s["impact_severity"] in ["high", "critical"]])
if high_impact_scenarios > 0 or max_impact > 10:
return "high"
elif max_impact > 5 or len(impact_scenarios) > 3:
return "medium"
else:
return "low"
def _calculate_total_impact_score(self, upstream_impact: Dict, downstream_impact: Dict) -> float:
"""Calculate total impact score combining upstream and downstream risks"""
upstream_score = 0
downstream_score = 0
if upstream_impact:
max_upstream_impact = upstream_impact.get("max_potential_impact", 0)
upstream_score = min(max_upstream_impact * 0.3, 10) # Cap at 10
if downstream_impact:
downstream_score = min(downstream_impact.get("total_impact", 0) * 0.7, 10) # Cap at 10
return round(upstream_score + downstream_score, 2)
async def _calculate_global_impact_analysis(self, dependency_graph: Dict) -> Dict[str, Any]:
"""Calculate global impact analysis for all tables"""
table_impacts = {}
for table_name in dependency_graph.keys():
impact = await self._calculate_impact_analysis(table_name, dependency_graph, "downstream")
table_impacts[table_name] = {
"downstream_impact": impact["downstream_impact"]["total_impact"],
"impact_severity": impact["downstream_impact"]["impact_severity"],
"impact_score": impact["total_impact_score"]
}
# Find most critical tables
critical_tables = sorted(
table_impacts.items(),
key=lambda x: x[1]["impact_score"],
reverse=True
)[:15]
# Risk distribution
risk_distribution = {
"critical": len([t for t in table_impacts.values() if t["impact_severity"] == "critical"]),
"high": len([t for t in table_impacts.values() if t["impact_severity"] == "high"]),
"medium": len([t for t in table_impacts.values() if t["impact_severity"] == "medium"]),
"low": len([t for t in table_impacts.values() if t["impact_severity"] == "low"]),
"none": len([t for t in table_impacts.values() if t["impact_severity"] == "none"])
}
return {
"global_impact_summary": {
"total_tables_analyzed": len(table_impacts),
"tables_with_impact": len([t for t in table_impacts.values() if t["downstream_impact"] > 0]),
"average_impact_score": round(sum(t["impact_score"] for t in table_impacts.values()) / len(table_impacts), 2) if table_impacts else 0,
"risk_distribution": risk_distribution
},
"most_critical_tables": [{"table": name, **stats} for name, stats in critical_tables],
"risk_matrix": self._generate_risk_matrix(table_impacts)
}
def _generate_risk_matrix(self, table_impacts: Dict[str, Dict]) -> Dict[str, List[str]]:
"""Generate risk matrix categorizing tables by impact level"""
risk_matrix = {
"critical_risk": [],
"high_risk": [],
"medium_risk": [],
"low_risk": [],
"minimal_risk": []
}
for table_name, impact_data in table_impacts.items():
severity = impact_data["impact_severity"]
if severity == "critical":
risk_matrix["critical_risk"].append(table_name)
elif severity == "high":
risk_matrix["high_risk"].append(table_name)
elif severity == "medium":
risk_matrix["medium_risk"].append(table_name)
elif severity == "low":
risk_matrix["low_risk"].append(table_name)
else:
risk_matrix["minimal_risk"].append(table_name)
return risk_matrix
def _get_dependency_graph_stats(self, dependency_graph: Dict) -> Dict[str, Any]:
"""Get statistics about the dependency graph"""
total_tables = len(dependency_graph)
total_dependencies = sum(
len(table_info.get("upstream_dependencies", set())) + len(table_info.get("downstream_dependencies", set()))
for table_info in dependency_graph.values()
) // 2 # Divide by 2 to avoid double counting
tables_with_upstream = len([
table for table, info in dependency_graph.items()
if info.get("upstream_dependencies")
])
tables_with_downstream = len([
table for table, info in dependency_graph.items()
if info.get("downstream_dependencies")
])
isolated_tables = len([
table for table, info in dependency_graph.items()
if not info.get("upstream_dependencies") and not info.get("downstream_dependencies")
])
return {
"total_tables": total_tables,
"total_dependencies": total_dependencies,
"tables_with_upstream_deps": tables_with_upstream,
"tables_with_downstream_deps": tables_with_downstream,
"isolated_tables": isolated_tables,
"connectivity_ratio": round((total_tables - isolated_tables) / total_tables, 3) if total_tables > 0 else 0,
"avg_dependencies_per_table": round(total_dependencies / total_tables, 2) if total_tables > 0 else 0
}
async def _generate_dependency_insights(self, dependency_graph: Dict, table_analysis: Dict, impact_analysis: Dict) -> Dict[str, Any]:
"""Generate insights from dependency analysis"""
insights = {
"architectural_patterns": {},
"risk_assessment": {},
"optimization_opportunities": {}
}
# Architectural patterns
graph_stats = self._get_dependency_graph_stats(dependency_graph)
insights["architectural_patterns"] = {
"connectivity_level": "high" if graph_stats["connectivity_ratio"] > 0.7 else "medium" if graph_stats["connectivity_ratio"] > 0.3 else "low",
"architecture_type": self._classify_architecture_type(graph_stats),
"complexity_score": round(graph_stats["avg_dependencies_per_table"] * graph_stats["connectivity_ratio"], 2),
"isolated_tables_concern": graph_stats["isolated_tables"] > graph_stats["total_tables"] * 0.3
}
# Risk assessment
if isinstance(impact_analysis, dict) and "global_impact_summary" in impact_analysis:
global_impact = impact_analysis["global_impact_summary"]
insights["risk_assessment"] = {
"overall_risk_level": self._assess_overall_risk_level(global_impact["risk_distribution"]),
"critical_tables_count": global_impact["risk_distribution"]["critical"],
"high_risk_tables_count": global_impact["risk_distribution"]["high"],
"impact_concentration": global_impact["average_impact_score"] > 5.0,
"resilience_score": self._calculate_resilience_score(global_impact)
}
# Optimization opportunities
insights["optimization_opportunities"] = self._identify_optimization_opportunities(dependency_graph, table_analysis)
return insights
def _classify_architecture_type(self, graph_stats: Dict) -> str:
"""Classify the overall architecture type"""
connectivity = graph_stats["connectivity_ratio"]
avg_deps = graph_stats["avg_dependencies_per_table"]
if connectivity > 0.8 and avg_deps > 3:
return "highly_interconnected"
elif connectivity > 0.5 and avg_deps > 2:
return "moderately_connected"
elif connectivity < 0.3:
return "loosely_coupled"
else:
return "mixed_architecture"
def _assess_overall_risk_level(self, risk_distribution: Dict[str, int]) -> str:
"""Assess overall risk level from risk distribution"""
total = sum(risk_distribution.values())
if total == 0:
return "minimal"
critical_ratio = risk_distribution["critical"] / total
high_ratio = risk_distribution["high"] / total
if critical_ratio > 0.1 or high_ratio > 0.2:
return "high"
elif critical_ratio > 0.05 or high_ratio > 0.1:
return "medium"
else:
return "low"
def _calculate_resilience_score(self, global_impact: Dict) -> float:
"""Calculate system resilience score (0-1, higher is better)"""
total_tables = global_impact["total_tables_analyzed"]
risk_dist = global_impact["risk_distribution"]
if total_tables == 0:
return 0.0
# Calculate weighted risk score
weighted_risk = (
risk_dist["critical"] * 5 +
risk_dist["high"] * 3 +
risk_dist["medium"] * 2 +
risk_dist["low"] * 1
) / total_tables
# Convert to resilience score (inverse of risk, normalized)
max_possible_risk = 5.0
resilience = max(0, (max_possible_risk - weighted_risk) / max_possible_risk)
return round(resilience, 3)
def _identify_optimization_opportunities(self, dependency_graph: Dict, table_analysis: Dict) -> List[Dict]:
"""Identify optimization opportunities"""
opportunities = []
# Find tables with excessive dependencies
for table_name, table_info in dependency_graph.items():
upstream_count = len(table_info.get("upstream_dependencies", set()))
downstream_count = len(table_info.get("downstream_dependencies", set()))
if upstream_count > 10:
opportunities.append({
"type": "excessive_upstream_dependencies",
"table": table_name,
"description": f"Table has {upstream_count} upstream dependencies",
"recommendation": "Consider breaking down complex transformations or using intermediate tables",
"priority": "high" if upstream_count > 15 else "medium"
})
if downstream_count > 10:
opportunities.append({
"type": "excessive_downstream_dependencies",
"table": table_name,
"description": f"Table has {downstream_count} downstream dependencies",
"recommendation": "Consider if this table is doing too much or if views could be used",
"priority": "high" if downstream_count > 15 else "medium"
})
# Find potential circular dependencies (simplified check)
# This is a basic check - full cycle detection would be more complex
for table_name, table_info in dependency_graph.items():
upstream_deps = table_info.get("upstream_dependencies", set())
for upstream_table in upstream_deps:
if table_name in dependency_graph.get(upstream_table, {}).get("upstream_dependencies", set()):
opportunities.append({
"type": "potential_circular_dependency",
"table": table_name,
"related_table": upstream_table,
"description": f"Potential circular dependency between {table_name} and {upstream_table}",
"recommendation": "Review and eliminate circular dependencies",
"priority": "high"
})
return opportunities
def _generate_dependency_recommendations(self, dependency_insights: Dict) -> List[Dict]:
"""Generate recommendations based on dependency analysis"""
recommendations = []
# Architecture recommendations
arch_patterns = dependency_insights.get("architectural_patterns", {})
if arch_patterns.get("isolated_tables_concern", False):
recommendations.append({
"type": "architecture",
"priority": "medium",
"title": "High number of isolated tables",
"description": "Many tables have no dependencies, which may indicate data silos",
"action": "Review isolated tables and consider if they should be integrated into data flows"
})
complexity_score = arch_patterns.get("complexity_score", 0)
if complexity_score > 5:
recommendations.append({
"type": "architecture",
"priority": "high",
"title": "High system complexity",
"description": f"System complexity score is {complexity_score} (high)",
"action": "Consider simplifying data architecture and reducing unnecessary dependencies"
})
# Risk recommendations
risk_assessment = dependency_insights.get("risk_assessment", {})
overall_risk = risk_assessment.get("overall_risk_level", "unknown")
if overall_risk == "high":
recommendations.append({
"type": "risk_mitigation",
"priority": "high",
"title": "High overall system risk",
"description": "System has high dependency risks that could cause widespread failures",
"action": "Implement monitoring and backup strategies for critical tables"
})
critical_tables = risk_assessment.get("critical_tables_count", 0)
if critical_tables > 0:
recommendations.append({
"type": "risk_mitigation",
"priority": "high",
"title": f"{critical_tables} critical impact tables identified",
"description": "Tables with critical impact require special attention",
"action": "Implement enhanced monitoring and backup procedures for critical tables"
})
# Optimization recommendations
optimization_ops = dependency_insights.get("optimization_opportunities", [])
if optimization_ops:
high_priority_ops = [op for op in optimization_ops if op.get("priority") == "high"]
if high_priority_ops:
recommendations.append({
"type": "optimization",
"priority": "high",
"title": f"{len(high_priority_ops)} high-priority optimization opportunities",
"description": "System has optimization opportunities that should be addressed",
"action": "Review and implement suggested optimizations for better maintainability"
})
return recommendations

View File

@@ -15,77 +15,573 @@
# specific language governing permissions and limitations
# under the License.
"""
Logging configuration for Doris MCP Server.
Enhanced Logging configuration for Doris MCP Server.
Features:
- Log level-based file separation
- Timestamped log entries
- Automatic log rotation
- Comprehensive logging coverage
"""
import logging
import logging.config
import logging.handlers
import sys
import os
import asyncio
import time
from pathlib import Path
from typing import Any
from typing import Any, Optional
from datetime import datetime, timedelta
import threading
def setup_logging(
level: str = "INFO",
log_file: str | None = None,
log_format: str | None = None,
) -> None:
class TimestampedFormatter(logging.Formatter):
"""Custom formatter with enhanced timestamp and structured format"""
def __init__(self, fmt=None, datefmt=None, style='%'):
if fmt is None:
fmt = "%(asctime)s.%(msecs)03d %(level_aligned)s %(name)s:%(lineno)d - %(message)s"
if datefmt is None:
datefmt = "%Y-%m-%d %H:%M:%S"
super().__init__(fmt, datefmt, style)
def format(self, record):
"""Format log record with enhanced information and proper alignment"""
# Add process info if available
if hasattr(record, 'process') and record.process:
record.process_info = f"[PID:{record.process}]"
else:
record.process_info = ""
# Add thread info if available
if hasattr(record, 'thread') and record.thread:
record.thread_info = f"[TID:{record.thread}]"
else:
record.thread_info = ""
# Format with proper alignment after the level name
# Calculate padding needed for alignment
level_name = record.levelname
max_level_length = 8 # Length of "CRITICAL"
padding = max_level_length - len(level_name)
record.level_aligned = f"[{level_name}]{' ' * padding}"
return super().format(record)
class LevelBasedFileHandler(logging.Handler):
"""Custom handler that writes different log levels to different files"""
def __init__(self, log_dir: str, base_name: str = "doris_mcp_server",
max_bytes: int = 10*1024*1024, backup_count: int = 5):
super().__init__()
self.log_dir = Path(log_dir)
self.base_name = base_name
self.max_bytes = max_bytes
self.backup_count = backup_count
# Ensure log directory exists
self.log_dir.mkdir(parents=True, exist_ok=True)
# Create handlers for different log levels
self.handlers = {}
self._setup_level_handlers()
def _setup_level_handlers(self):
"""Setup rotating file handlers for different log levels"""
level_files = {
'DEBUG': 'debug.log',
'INFO': 'info.log',
'WARNING': 'warning.log',
'ERROR': 'error.log',
'CRITICAL': 'critical.log'
}
formatter = TimestampedFormatter()
for level, filename in level_files.items():
file_path = self.log_dir / f"{self.base_name}_{filename}"
handler = logging.handlers.RotatingFileHandler(
file_path,
maxBytes=self.max_bytes,
backupCount=self.backup_count,
encoding='utf-8'
)
handler.setFormatter(formatter)
handler.setLevel(getattr(logging, level))
self.handlers[level] = handler
def emit(self, record):
"""Emit log record to appropriate level-based file"""
level_name = record.levelname
if level_name in self.handlers:
try:
self.handlers[level_name].emit(record)
except Exception:
self.handleError(record)
def close(self):
"""Close all handlers"""
for handler in self.handlers.values():
handler.close()
super().close()
class LogCleanupManager:
"""Log file cleanup manager for automatic maintenance"""
def __init__(self, log_dir: str, max_age_days: int = 30, cleanup_interval_hours: int = 24):
"""
Setup logging configuration.
Initialize log cleanup manager.
Args:
level: Logging level (DEBUG, INFO, WARNING, ERROR)
log_file: Optional log file path
log_format: Optional custom log format
log_dir: Directory containing log files
max_age_days: Maximum age of log files in days (default: 30 days)
cleanup_interval_hours: Cleanup interval in hours (default: 24 hours)
"""
if log_format is None:
log_format = "%(asctime)s - %(name)s - %(levelname)s - %(message)s"
self.log_dir = Path(log_dir)
self.max_age_days = max_age_days
self.cleanup_interval_hours = cleanup_interval_hours
self.cleanup_thread = None
self.stop_event = threading.Event()
self.logger = None
# Base configuration
config: dict[str, Any] = {
"version": 1,
"disable_existing_loggers": False,
"formatters": {
"default": {"format": log_format, "datefmt": "%Y-%m-%d %H:%M:%S"}
},
"handlers": {
"console": {
"class": "logging.StreamHandler",
"level": level,
"formatter": "default",
"stream": sys.stdout,
}
},
"root": {"level": level, "handlers": ["console"]},
"loggers": {
"doris_mcp_server": {
"level": level,
"handlers": ["console"],
"propagate": False,
}
},
def start_cleanup_scheduler(self):
"""Start the cleanup scheduler in a background thread"""
if self.cleanup_thread and self.cleanup_thread.is_alive():
return
self.stop_event.clear()
self.cleanup_thread = threading.Thread(target=self._cleanup_loop, daemon=True)
self.cleanup_thread.start()
# Get logger for this class
if not self.logger:
self.logger = logging.getLogger("doris_mcp_server.log_cleanup")
self.logger.info(f"Log cleanup scheduler started - cleanup every {self.cleanup_interval_hours}h, max age {self.max_age_days} days")
def stop_cleanup_scheduler(self):
"""Stop the cleanup scheduler"""
if self.cleanup_thread and self.cleanup_thread.is_alive():
self.stop_event.set()
self.cleanup_thread.join(timeout=5)
if self.logger:
self.logger.info("Log cleanup scheduler stopped")
def _cleanup_loop(self):
"""Background loop for periodic cleanup"""
while not self.stop_event.is_set():
try:
self.cleanup_old_logs()
# Sleep for the specified interval, but check stop event every 60 seconds
for _ in range(self.cleanup_interval_hours * 60): # Convert hours to minutes
if self.stop_event.wait(60): # Wait 60 seconds or until stop event
break
except Exception as e:
if self.logger:
self.logger.error(f"Error in log cleanup loop: {e}")
# Sleep for 5 minutes before retrying
self.stop_event.wait(300)
def cleanup_old_logs(self):
"""Clean up old log files based on age"""
if not self.log_dir.exists():
return
current_time = datetime.now()
cutoff_time = current_time - timedelta(days=self.max_age_days)
cleaned_files = []
cleaned_size = 0
# Pattern for log files (including backup files)
log_patterns = [
"doris_mcp_server_*.log",
"doris_mcp_server_*.log.*" # Backup files
]
for pattern in log_patterns:
for log_file in self.log_dir.glob(pattern):
try:
# Get file modification time
file_mtime = datetime.fromtimestamp(log_file.stat().st_mtime)
if file_mtime < cutoff_time:
file_size = log_file.stat().st_size
log_file.unlink() # Delete the file
cleaned_files.append(log_file.name)
cleaned_size += file_size
except Exception as e:
if self.logger:
self.logger.warning(f"Failed to cleanup log file {log_file}: {e}")
if cleaned_files and self.logger:
size_mb = cleaned_size / (1024 * 1024)
self.logger.info(f"Cleaned up {len(cleaned_files)} old log files, freed {size_mb:.2f} MB")
self.logger.debug(f"Cleaned files: {', '.join(cleaned_files)}")
def get_cleanup_stats(self) -> dict:
"""Get statistics about log files and cleanup status"""
if not self.log_dir.exists():
return {"error": "Log directory does not exist"}
stats = {
"log_directory": str(self.log_dir.absolute()),
"max_age_days": self.max_age_days,
"cleanup_interval_hours": self.cleanup_interval_hours,
"scheduler_running": self.cleanup_thread and self.cleanup_thread.is_alive(),
"total_files": 0,
"total_size_mb": 0,
"files_by_age": {"recent": 0, "old": 0},
"oldest_file": None,
"newest_file": None
}
# Add file handler if log_file is specified
if log_file:
# Ensure log directory exists
log_path = Path(log_file)
log_path.parent.mkdir(parents=True, exist_ok=True)
current_time = datetime.now()
cutoff_time = current_time - timedelta(days=self.max_age_days)
oldest_time = None
newest_time = None
config["handlers"]["file"] = {
"class": "logging.handlers.RotatingFileHandler",
"level": level,
"formatter": "default",
"filename": log_file,
"maxBytes": 10485760, # 10MB
"backupCount": 5,
}
log_patterns = ["doris_mcp_server_*.log", "doris_mcp_server_*.log.*"]
# Add file handler to root and package loggers
config["root"]["handlers"].append("file")
config["loggers"]["doris_mcp_server"]["handlers"].append("file")
for pattern in log_patterns:
for log_file in self.log_dir.glob(pattern):
try:
file_stat = log_file.stat()
file_mtime = datetime.fromtimestamp(file_stat.st_mtime)
logging.config.dictConfig(config)
stats["total_files"] += 1
stats["total_size_mb"] += file_stat.st_size / (1024 * 1024)
if file_mtime < cutoff_time:
stats["files_by_age"]["old"] += 1
else:
stats["files_by_age"]["recent"] += 1
if oldest_time is None or file_mtime < oldest_time:
oldest_time = file_mtime
stats["oldest_file"] = {"name": log_file.name, "age_days": (current_time - file_mtime).days}
if newest_time is None or file_mtime > newest_time:
newest_time = file_mtime
stats["newest_file"] = {"name": log_file.name, "age_days": (current_time - file_mtime).days}
except Exception:
continue
stats["total_size_mb"] = round(stats["total_size_mb"], 2)
return stats
class DorisLoggerManager:
"""Centralized logger manager for Doris MCP Server"""
def __init__(self):
self.is_initialized = False
self.log_dir = None
self.config = None
self.loggers = {}
self.cleanup_manager = None
def setup_logging(self,
level: str = "INFO",
log_dir: str = "logs",
enable_console: bool = True,
enable_file: bool = True,
enable_audit: bool = True,
audit_file: Optional[str] = None,
max_file_size: int = 10*1024*1024,
backup_count: int = 5,
enable_cleanup: bool = True,
max_age_days: int = 30,
cleanup_interval_hours: int = 24) -> None:
"""
Setup comprehensive logging configuration.
Args:
level: Base logging level (DEBUG, INFO, WARNING, ERROR, CRITICAL)
log_dir: Directory for log files
enable_console: Enable console output
enable_file: Enable file logging
enable_audit: Enable audit logging
audit_file: Custom audit log file path
max_file_size: Maximum size per log file (bytes)
backup_count: Number of backup files to keep
enable_cleanup: Enable automatic log cleanup
max_age_days: Maximum age of log files in days (default: 30)
cleanup_interval_hours: Cleanup interval in hours (default: 24)
"""
if self.is_initialized:
return
self.log_dir = Path(log_dir)
log_dir_writable = True # Initialize the variable
# Try to create log directory, fallback to console-only if fails
try:
self.log_dir.mkdir(parents=True, exist_ok=True)
except (OSError, PermissionError) as e:
# If we can't create log directory (e.g., read-only filesystem in stdio mode),
# fall back to console-only logging
log_dir_writable = False
enable_file = False
enable_audit = False
enable_cleanup = False
# Don't use print() in stdio mode as it interferes with MCP JSON protocol
# Log the warning through the logging system instead, which will be handled after setup
# Clear existing handlers
root_logger = logging.getLogger()
for handler in root_logger.handlers[:]:
root_logger.removeHandler(handler)
# Set root logger level
root_logger.setLevel(logging.DEBUG) # Allow all levels, handlers will filter
handlers = []
# Console handler
if enable_console:
console_handler = logging.StreamHandler(sys.stdout)
console_handler.setLevel(getattr(logging, level.upper()))
console_formatter = TimestampedFormatter(
fmt="%(asctime)s.%(msecs)03d %(level_aligned)s %(name)s - %(message)s",
datefmt="%Y-%m-%d %H:%M:%S"
)
console_handler.setFormatter(console_formatter)
handlers.append(console_handler)
# Level-based file handlers
if enable_file:
level_handler = LevelBasedFileHandler(
log_dir=str(self.log_dir),
base_name="doris_mcp_server",
max_bytes=max_file_size,
backup_count=backup_count
)
level_handler.setLevel(logging.DEBUG) # Accept all levels
handlers.append(level_handler)
# Combined application log (all levels in one file)
if enable_file:
app_log_file = self.log_dir / "doris_mcp_server_all.log"
app_handler = logging.handlers.RotatingFileHandler(
app_log_file,
maxBytes=max_file_size,
backupCount=backup_count,
encoding='utf-8'
)
app_handler.setLevel(getattr(logging, level.upper()))
app_formatter = TimestampedFormatter()
app_handler.setFormatter(app_formatter)
handlers.append(app_handler)
# Audit logger (separate from main logging)
if enable_audit:
audit_file_path = audit_file or str(self.log_dir / "doris_mcp_server_audit.log")
audit_logger = logging.getLogger("audit")
audit_logger.setLevel(logging.INFO)
# Clear existing audit handlers
for handler in audit_logger.handlers[:]:
audit_logger.removeHandler(handler)
audit_handler = logging.handlers.RotatingFileHandler(
audit_file_path,
maxBytes=max_file_size,
backupCount=backup_count,
encoding='utf-8'
)
audit_formatter = TimestampedFormatter(
fmt="%(asctime)s.%(msecs)03d [AUDIT] %(name)s - %(message)s",
datefmt="%Y-%m-%d %H:%M:%S"
)
audit_handler.setFormatter(audit_formatter)
audit_logger.addHandler(audit_handler)
audit_logger.propagate = False # Don't propagate to root logger
# Add all handlers to root logger
for handler in handlers:
root_logger.addHandler(handler)
# Setup package-specific loggers
self._setup_package_loggers(level)
# Setup log cleanup manager
if enable_cleanup and enable_file:
self.cleanup_manager = LogCleanupManager(
log_dir=str(self.log_dir),
max_age_days=max_age_days,
cleanup_interval_hours=cleanup_interval_hours
)
self.cleanup_manager.start_cleanup_scheduler()
self.is_initialized = True
# Log initialization message
logger = self.get_logger("doris_mcp_server.logger")
logger.info("=" * 80)
logger.info("Doris MCP Server Logging System Initialized")
logger.info(f"Log Level: {level}")
if log_dir_writable:
logger.info(f"Log Directory: {self.log_dir.absolute()}")
else:
logger.info("Log Directory: Not available (console-only mode)")
logger.info(f"Console Logging: {'Enabled' if enable_console else 'Disabled'}")
logger.info(f"File Logging: {'Enabled' if enable_file else 'Disabled (fallback mode)'}")
logger.info(f"Audit Logging: {'Enabled' if enable_audit else 'Disabled (fallback mode)'}")
logger.info(f"Log Cleanup: {'Enabled' if enable_cleanup and enable_file else 'Disabled (fallback mode)'}")
if enable_cleanup and enable_file:
logger.info(f"Cleanup Settings: Max age {max_age_days} days, interval {cleanup_interval_hours}h")
if not log_dir_writable:
logger.warning("Running in console-only logging mode due to filesystem permissions")
logger.warning(f"Could not create log directory '{log_dir}' - stdio mode fallback enabled")
logger.info("=" * 80)
def _setup_package_loggers(self, level: str):
"""Setup specific loggers for different modules"""
package_loggers = [
"doris_mcp_server",
"doris_mcp_server.main",
"doris_mcp_server.utils",
"doris_mcp_server.tools",
"doris_mcp_client"
]
for logger_name in package_loggers:
logger = logging.getLogger(logger_name)
logger.setLevel(getattr(logging, level.upper()))
# Don't add handlers here - they inherit from root logger
def get_logger(self, name: str) -> logging.Logger:
"""
Get a logger instance with proper configuration.
Args:
name: Logger name (usually __name__)
Returns:
Configured logger instance
"""
if name not in self.loggers:
logger = logging.getLogger(name)
self.loggers[name] = logger
return self.loggers[name]
def get_audit_logger(self) -> logging.Logger:
"""Get the audit logger"""
return logging.getLogger("audit")
def log_system_info(self):
"""Log system information for debugging"""
logger = self.get_logger("doris_mcp_server.system")
logger.info("System Information:")
logger.info(f"Python Version: {sys.version}")
logger.info(f"Platform: {sys.platform}")
logger.info(f"Working Directory: {os.getcwd()}")
logger.info(f"Process ID: {os.getpid()}")
# Log environment variables (filtered)
env_vars = ["LOG_LEVEL", "LOG_FILE_PATH", "ENABLE_AUDIT", "AUDIT_FILE_PATH"]
for var in env_vars:
value = os.getenv(var, "Not Set")
logger.info(f"Environment {var}: {value}")
def get_cleanup_stats(self) -> dict:
"""Get log cleanup statistics"""
if self.cleanup_manager:
return self.cleanup_manager.get_cleanup_stats()
else:
return {"error": "Log cleanup is not enabled"}
def manual_cleanup(self) -> dict:
"""Manually trigger log cleanup and return statistics"""
if self.cleanup_manager:
self.cleanup_manager.cleanup_old_logs()
return self.cleanup_manager.get_cleanup_stats()
else:
return {"error": "Log cleanup is not enabled"}
def shutdown(self):
"""Shutdown logging system"""
if not self.is_initialized:
return
logger = self.get_logger("doris_mcp_server.logger")
logger.info("Shutting down logging system...")
# Stop cleanup manager
if self.cleanup_manager:
self.cleanup_manager.stop_cleanup_scheduler()
# Close all handlers
root_logger = logging.getLogger()
for handler in root_logger.handlers[:]:
try:
handler.close()
except Exception as e:
print(f"Error closing handler: {e}")
# Close audit logger handlers
audit_logger = logging.getLogger("audit")
for handler in audit_logger.handlers[:]:
try:
handler.close()
except Exception as e:
print(f"Error closing audit handler: {e}")
self.is_initialized = False
# Global logger manager instance
_logger_manager = DorisLoggerManager()
def setup_logging(level: str = "INFO",
log_dir: str = "logs",
enable_console: bool = True,
enable_file: bool = True,
enable_audit: bool = True,
audit_file: Optional[str] = None,
max_file_size: int = 10*1024*1024,
backup_count: int = 5,
enable_cleanup: bool = True,
max_age_days: int = 30,
cleanup_interval_hours: int = 24) -> None:
"""
Setup logging configuration (convenience function).
Args:
level: Logging level (DEBUG, INFO, WARNING, ERROR, CRITICAL)
log_dir: Directory for log files
enable_console: Enable console output
enable_file: Enable file logging
enable_audit: Enable audit logging
audit_file: Custom audit log file path
max_file_size: Maximum size per log file (bytes)
backup_count: Number of backup files to keep
enable_cleanup: Enable automatic log cleanup
max_age_days: Maximum age of log files in days (default: 30)
cleanup_interval_hours: Cleanup interval in hours (default: 24)
"""
_logger_manager.setup_logging(
level=level,
log_dir=log_dir,
enable_console=enable_console,
enable_file=enable_file,
enable_audit=enable_audit,
audit_file=audit_file,
max_file_size=max_file_size,
backup_count=backup_count,
enable_cleanup=enable_cleanup,
max_age_days=max_age_days,
cleanup_interval_hours=cleanup_interval_hours
)
def get_logger(name: str) -> logging.Logger:
@@ -93,9 +589,60 @@ def get_logger(name: str) -> logging.Logger:
Get a logger instance.
Args:
name: Logger name
name: Logger name (usually __name__)
Returns:
Logger instance
Configured logger instance
"""
return logging.getLogger(name)
return _logger_manager.get_logger(name)
def get_audit_logger() -> logging.Logger:
"""Get the audit logger"""
return _logger_manager.get_audit_logger()
def log_system_info():
"""Log system information for debugging"""
_logger_manager.log_system_info()
def get_cleanup_stats() -> dict:
"""Get log cleanup statistics"""
return _logger_manager.get_cleanup_stats()
def manual_cleanup() -> dict:
"""Manually trigger log cleanup and return statistics"""
return _logger_manager.manual_cleanup()
def shutdown_logging():
"""Shutdown logging system"""
_logger_manager.shutdown()
# Compatibility function for existing code
def setup_logging_old(level: str = "INFO",
log_file: str | None = None,
log_format: str | None = None) -> None:
"""
Legacy setup function for backward compatibility.
Args:
level: Logging level (DEBUG, INFO, WARNING, ERROR)
log_file: Optional log file path (deprecated - use log_dir instead)
log_format: Optional custom log format (deprecated)
"""
# Extract directory from log_file if provided
log_dir = "logs"
if log_file:
log_dir = str(Path(log_file).parent)
setup_logging(
level=level,
log_dir=log_dir,
enable_console=True,
enable_file=True,
enable_audit=True
)

File diff suppressed because it is too large Load Diff

View File

@@ -34,6 +34,7 @@ from typing import Any, Dict
from decimal import Decimal
from .db import DorisConnectionManager, QueryResult
from .logger import get_logger
@dataclass
@@ -92,7 +93,7 @@ class QueryCache:
self.max_size = max_size
self.default_ttl = default_ttl
self.cache: dict[str, CachedQuery] = {}
self.logger = logging.getLogger(__name__)
self.logger = get_logger(__name__)
def _generate_cache_key(
self, sql: str, parameters: dict[str, Any] | None = None
@@ -194,7 +195,7 @@ class QueryOptimizer:
def __init__(self, config):
self.config = config
self.logger = logging.getLogger(__name__)
self.logger = get_logger(__name__)
self.optimization_rules = self._load_optimization_rules()
def _load_optimization_rules(self) -> list[dict[str, Any]]:
@@ -318,7 +319,7 @@ class DorisQueryExecutor:
def __init__(self, connection_manager: DorisConnectionManager, config=None):
self.connection_manager = connection_manager
self.config = config or self._create_default_config()
self.logger = logging.getLogger(__name__)
self.logger = get_logger(__name__)
# Initialize components
cache_config = getattr(self.config, 'performance', None)
@@ -587,7 +588,6 @@ class DorisQueryExecutor:
)
# Execute query with retry logic
try:
result = await self.execute_query(query_request, auth_context)
# Serialize data for JSON response
@@ -606,9 +606,11 @@ class DorisQueryExecutor:
}
}
except Exception as query_error:
except Exception as e:
error_msg = str(e)
error_str = error_msg.lower()
# Check if it's a connection-related error that we should retry
error_str = str(query_error).lower()
connection_errors = [
"at_eof", "connection", "closed", "nonetype",
"transport", "reader", "broken pipe", "connection reset"
@@ -618,7 +620,7 @@ class DorisQueryExecutor:
if is_connection_error and retry_count < max_retries:
retry_count += 1
self.logger.warning(f"Connection error detected, retrying ({retry_count}/{max_retries}): {query_error}")
self.logger.warning(f"Connection error detected, retrying ({retry_count}/{max_retries}): {e}")
# Release the problematic connection
try:
@@ -630,14 +632,7 @@ class DorisQueryExecutor:
await asyncio.sleep(0.5 * retry_count)
continue
else:
# Re-raise if not a connection error or max retries exceeded
raise query_error
except Exception as e:
error_msg = str(e)
# If we've exhausted retries or it's not a connection error, return error
if retry_count >= max_retries or "at_eof" not in error_msg.lower():
error_analysis = self._analyze_error(error_msg)
return {
@@ -651,21 +646,14 @@ class DorisQueryExecutor:
"retry_count": retry_count
}
}
else:
# Try one more time for connection errors
retry_count += 1
if retry_count <= max_retries:
self.logger.warning(f"Retrying query due to connection error ({retry_count}/{max_retries}): {e}")
await asyncio.sleep(0.5 * retry_count)
continue
else:
# This should never be reached, but just in case
return {
"success": False,
"error": f"Query failed after {max_retries} retries: {error_msg}",
"error": "Maximum retries exceeded",
"data": None,
"metadata": {
"query": sql,
"error_details": error_msg,
"retry_count": retry_count
}
}
@@ -759,7 +747,7 @@ class QueryPerformanceMonitor:
def __init__(self, query_executor: DorisQueryExecutor):
self.query_executor = query_executor
self.logger = logging.getLogger(__name__)
self.logger = get_logger(__name__)
self.performance_records = []
async def record_query_performance(

View File

@@ -31,7 +31,7 @@ from dotenv import load_dotenv
from datetime import datetime, timedelta
# Import unified logging configuration
from doris_mcp_server.utils.logger import get_logger
from .logger import get_logger
# Configure logging
logger = get_logger(__name__)
@@ -1215,33 +1215,39 @@ class MetadataExtractor:
try:
if self.connection_manager:
import asyncio
# Try to run the async query
try:
# Check if there's a running event loop
loop = asyncio.get_running_loop()
# If we're in an async context, we need to run in a separate thread
import concurrent.futures
import threading
# Always run in a separate thread with new event loop to avoid conflicts
def run_in_new_loop():
# Create new event loop for this thread
new_loop = asyncio.new_event_loop()
asyncio.set_event_loop(new_loop)
try:
return new_loop.run_until_complete(
self._execute_query_async(query, db_name, return_dataframe)
)
finally:
try:
# Properly close the loop
pending = asyncio.all_tasks(new_loop)
if pending:
new_loop.run_until_complete(asyncio.gather(*pending, return_exceptions=True))
finally:
new_loop.close()
with concurrent.futures.ThreadPoolExecutor() as executor:
# Use ThreadPoolExecutor to run in separate thread
with concurrent.futures.ThreadPoolExecutor(max_workers=1) as executor:
future = executor.submit(run_in_new_loop)
try:
return future.result(timeout=30)
except RuntimeError:
# No running loop, we can safely create one
return asyncio.run(
self._execute_query_async(query, db_name, return_dataframe)
)
except concurrent.futures.TimeoutError:
logger.error("Query execution timed out after 30 seconds")
if return_dataframe:
import pandas as pd
return pd.DataFrame()
else:
return []
else:
# Fallback: Return empty result
logger.warning("No connection manager provided, returning empty result")

View File

@@ -31,6 +31,8 @@ import sqlparse
from sqlparse.sql import Statement
from sqlparse.tokens import Keyword, Name
from .logger import get_logger
class SecurityLevel(Enum):
"""Security level enumeration"""
@@ -86,7 +88,7 @@ class DorisSecurityManager:
def __init__(self, config):
self.config = config
self.logger = logging.getLogger(__name__)
self.logger = get_logger(__name__)
# Initialize security components
self.auth_provider = AuthenticationProvider(config)
@@ -211,7 +213,7 @@ class AuthenticationProvider:
def __init__(self, config):
self.config = config
self.logger = logging.getLogger(__name__)
self.logger = get_logger(__name__)
self.session_cache = {}
async def authenticate(self, auth_info: dict[str, Any]) -> AuthContext:
@@ -321,7 +323,7 @@ class AuthorizationProvider:
def __init__(self, config):
self.config = config
self.logger = logging.getLogger(__name__)
self.logger = get_logger(__name__)
self.permission_cache = {}
# Load sensitive tables configuration
@@ -464,7 +466,7 @@ class SQLSecurityValidator:
def __init__(self, config):
self.config = config
self.logger = logging.getLogger(__name__)
self.logger = get_logger(__name__)
# Handle DorisConfig object or dictionary configuration
if hasattr(config, 'get'):
@@ -686,7 +688,7 @@ class DataMaskingProcessor:
def __init__(self, config):
self.config = config
self.logger = logging.getLogger(__name__)
self.logger = get_logger(__name__)
self.masking_algorithms = self._init_masking_algorithms()
self.masking_rules = self._load_masking_rules()

View File

@@ -0,0 +1,783 @@
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.
"""
Security Analytics Tools Module
Provides data access analysis, user behavior monitoring, and security insights
"""
import time
from datetime import datetime, timedelta
from typing import Any, Dict, List, Optional
from collections import Counter, defaultdict
from .db import DorisConnectionManager
from .logger import get_logger
logger = get_logger(__name__)
class SecurityAnalyticsTools:
"""Security analytics tools for access pattern analysis and user monitoring"""
def __init__(self, connection_manager: DorisConnectionManager):
self.connection_manager = connection_manager
logger.info("SecurityAnalyticsTools initialized")
async def analyze_data_access_patterns(
self,
days: int = 7,
include_system_users: bool = False,
min_query_threshold: int = 5
) -> Dict[str, Any]:
"""
Analyze data access patterns for users and roles
Args:
days: Number of days to analyze
include_system_users: Whether to include system/service users
min_query_threshold: Minimum queries for a user to be included in analysis
Returns:
Comprehensive access pattern analysis
"""
try:
start_time = time.time()
# 🚀 PROGRESS: Initialize security analysis
logger.info("=" * 70)
logger.info(f"🔒 Starting Data Access Pattern Analysis")
logger.info(f"📅 Analysis period: {days} days")
logger.info(f"👥 Include system users: {include_system_users}")
logger.info(f"🎯 Min query threshold: {min_query_threshold}")
logger.info("=" * 70)
connection = await self.connection_manager.get_connection("query")
# Define analysis period
end_date = datetime.now()
start_date = end_date - timedelta(days=days)
logger.info(f"📊 Period: {start_date.strftime('%Y-%m-%d')} to {end_date.strftime('%Y-%m-%d')}")
# 🚀 PROGRESS: Step 1 - Get audit log data
logger.info("📋 Step 1/5: Retrieving audit log data...")
audit_start = time.time()
audit_data = await self._get_audit_log_data(connection, start_date, end_date, include_system_users)
audit_time = time.time() - audit_start
if not audit_data:
logger.warning("⚠️ No audit data available for the specified period")
return {
"error": "No audit data available for the specified period",
"analysis_period": {
"start_date": start_date.isoformat(),
"end_date": end_date.isoformat(),
"days": days
}
}
logger.info(f"✅ Retrieved {len(audit_data)} audit records in {audit_time:.2f}s")
# 🚀 PROGRESS: Step 2 - Analyze user access patterns
logger.info("👤 Step 2/5: Analyzing user access patterns...")
user_start = time.time()
user_access_analysis = await self._analyze_user_access_patterns(
audit_data, min_query_threshold
)
user_time = time.time() - user_start
logger.info(f"✅ Analyzed {len(user_access_analysis)} users in {user_time:.2f}s")
# 🚀 PROGRESS: Step 3 - Analyze role-based access
logger.info("🎭 Step 3/5: Analyzing role-based access patterns...")
role_start = time.time()
role_access_analysis = await self._analyze_role_access_patterns(
connection, user_access_analysis
)
role_time = time.time() - role_start
logger.info(f"✅ Role analysis completed in {role_time:.2f}s")
# 🚀 PROGRESS: Step 4 - Detect security anomalies
logger.info("🚨 Step 4/5: Detecting security anomalies...")
anomaly_start = time.time()
security_alerts = await self._detect_security_anomalies(
audit_data, user_access_analysis
)
anomaly_time = time.time() - anomaly_start
logger.info(f"✅ Found {len(security_alerts)} security alerts in {anomaly_time:.2f}s")
# Log alert summary
if security_alerts:
high_alerts = sum(1 for alert in security_alerts if alert.get("severity") == "high")
medium_alerts = sum(1 for alert in security_alerts if alert.get("severity") == "medium")
logger.info(f"🚨 Alert breakdown: {high_alerts} high, {medium_alerts} medium")
# 🚀 PROGRESS: Step 5 - Generate access insights
logger.info("💡 Step 5/5: Generating access insights...")
insights_start = time.time()
access_insights = await self._generate_access_insights(
user_access_analysis, role_access_analysis
)
insights_time = time.time() - insights_start
logger.info(f"✅ Access insights generated in {insights_time:.2f}s")
execution_time = time.time() - start_time
return {
"analysis_period": {
"start_date": start_date.isoformat(),
"end_date": end_date.isoformat(),
"days": days
},
"analysis_timestamp": datetime.now().isoformat(),
"execution_time_seconds": round(execution_time, 3),
"user_access_summary": self._generate_user_access_summary(user_access_analysis),
"user_access_details": user_access_analysis,
"role_analysis": role_access_analysis,
"security_alerts": security_alerts,
"access_insights": access_insights,
"recommendations": self._generate_security_recommendations(security_alerts, access_insights)
}
except Exception as e:
logger.error(f"Data access pattern analysis failed: {str(e)}")
return {
"error": str(e),
"analysis_timestamp": datetime.now().isoformat()
}
# ==================== Private Helper Methods ====================
async def _get_audit_log_data(self, connection, start_date: datetime, end_date: datetime, include_system_users: bool) -> List[Dict]:
"""Retrieve audit log data for the specified period"""
try:
# System users filter
system_user_filter = ""
if not include_system_users:
system_users = ['root', 'admin', 'system', 'doris', 'information_schema']
user_list = ','.join([f'"{user}"' for user in system_users])
system_user_filter = f"AND `user` NOT IN ({user_list})"
audit_sql = f"""
SELECT
`user` as user_name,
`client_ip` as host,
`time` as query_time,
`stmt` as sql_statement,
`state` as query_status,
`scan_bytes` as scan_bytes,
`scan_rows` as scan_rows,
`return_rows` as return_rows,
`query_time` as execution_time_ms
FROM internal.__internal_schema.audit_log
WHERE `time` >= '{start_date.strftime('%Y-%m-%d %H:%M:%S')}'
AND `time` <= '{end_date.strftime('%Y-%m-%d %H:%M:%S')}'
AND `stmt` IS NOT NULL
AND `stmt` != ''
{system_user_filter}
ORDER BY `time` DESC
LIMIT 10000
"""
result = await connection.execute(audit_sql)
return result.data if result.data else []
except Exception as e:
logger.warning(f"Failed to get audit log data: {str(e)}")
# Try alternative method without detailed metrics
try:
simple_audit_sql = f"""
SELECT
`user` as user_name,
`client_ip` as host,
`time` as query_time,
`stmt` as sql_statement,
`state` as query_status
FROM internal.__internal_schema.audit_log
WHERE `time` >= '{start_date.strftime('%Y-%m-%d %H:%M:%S')}'
AND `time` <= '{end_date.strftime('%Y-%m-%d %H:%M:%S')}'
AND `stmt` IS NOT NULL
{system_user_filter}
ORDER BY `time` DESC
LIMIT 10000
"""
result = await connection.execute(simple_audit_sql)
return result.data if result.data else []
except Exception as e2:
logger.error(f"Failed to get simplified audit log data: {str(e2)}")
return []
async def _analyze_user_access_patterns(self, audit_data: List[Dict], min_query_threshold: int) -> List[Dict]:
"""Analyze access patterns for individual users"""
user_stats = defaultdict(lambda: {
"total_queries": 0,
"unique_tables_accessed": set(),
"hosts": set(),
"query_types": Counter(),
"query_times": [],
"failed_queries": 0,
"data_volume_read_bytes": 0,
"data_volume_read_rows": 0,
"hourly_pattern": [0] * 24,
"daily_pattern": [0] * 7,
"query_statements": []
})
# Process audit data
for entry in audit_data:
user_name = entry.get("user_name", "unknown")
query_time = entry.get("query_time")
sql_statement = entry.get("sql_statement", "")
query_status = entry.get("query_status", "")
stats = user_stats[user_name]
stats["total_queries"] += 1
# Extract table names from SQL
tables = self._extract_table_names_from_sql(sql_statement)
stats["unique_tables_accessed"].update(tables)
# Host tracking
if entry.get("host"):
stats["hosts"].add(entry["host"])
# Query type analysis
query_type = self._classify_query_type(sql_statement)
stats["query_types"][query_type] += 1
# Query time patterns
if query_time:
try:
if isinstance(query_time, str):
query_dt = datetime.fromisoformat(query_time.replace('Z', '+00:00'))
else:
query_dt = query_time
stats["query_times"].append(query_dt)
stats["hourly_pattern"][query_dt.hour] += 1
stats["daily_pattern"][query_dt.weekday()] += 1
except Exception:
pass
# Error tracking
if query_status and "error" in query_status.lower():
stats["failed_queries"] += 1
# Data volume tracking
if entry.get("scan_bytes"):
try:
stats["data_volume_read_bytes"] += int(entry["scan_bytes"])
except (ValueError, TypeError):
pass
if entry.get("scan_rows"):
try:
stats["data_volume_read_rows"] += int(entry["scan_rows"])
except (ValueError, TypeError):
pass
# Store sample queries
if len(stats["query_statements"]) < 10:
stats["query_statements"].append({
"sql": sql_statement[:200] + "..." if len(sql_statement) > 200 else sql_statement,
"timestamp": str(query_time),
"type": query_type
})
# Convert to analysis results
user_analysis = []
for user_name, stats in user_stats.items():
if stats["total_queries"] >= min_query_threshold:
# Calculate patterns and insights
access_pattern = self._classify_access_pattern(stats["hourly_pattern"])
table_access_frequency = dict(Counter(
table for entry in audit_data
if entry.get("user_name") == user_name
for table in self._extract_table_names_from_sql(entry.get("sql_statement", ""))
).most_common(10))
user_analysis.append({
"user_name": user_name,
"access_stats": {
"total_queries": stats["total_queries"],
"unique_tables_accessed": len(stats["unique_tables_accessed"]),
"unique_hosts": len(stats["hosts"]),
"data_volume_read_gb": round(stats["data_volume_read_bytes"] / (1024**3), 3),
"data_volume_read_rows": stats["data_volume_read_rows"],
"failed_queries": stats["failed_queries"],
"success_rate": round((stats["total_queries"] - stats["failed_queries"]) / stats["total_queries"], 3) if stats["total_queries"] > 0 else 0,
"peak_access_hour": stats["hourly_pattern"].index(max(stats["hourly_pattern"])) if max(stats["hourly_pattern"]) > 0 else None,
"access_pattern": access_pattern
},
"query_type_distribution": dict(stats["query_types"]),
"table_access_frequency": table_access_frequency,
"hosts_used": list(stats["hosts"]),
"sample_queries": stats["query_statements"],
"temporal_patterns": {
"hourly_distribution": stats["hourly_pattern"],
"daily_distribution": stats["daily_pattern"]
}
})
return sorted(user_analysis, key=lambda x: x["access_stats"]["total_queries"], reverse=True)
def _extract_table_names_from_sql(self, sql: str) -> List[str]:
"""Extract table names from SQL statement (simplified implementation)"""
if not sql:
return []
import re
# Simple regex patterns to match table names
patterns = [
r'\bFROM\s+([a-zA-Z_][a-zA-Z0-9_]*(?:\.[a-zA-Z_][a-zA-Z0-9_]*)*)',
r'\bJOIN\s+([a-zA-Z_][a-zA-Z0-9_]*(?:\.[a-zA-Z_][a-zA-Z0-9_]*)*)',
r'\bINTO\s+([a-zA-Z_][a-zA-Z0-9_]*(?:\.[a-zA-Z_][a-zA-Z0-9_]*)*)',
r'\bUPDATE\s+([a-zA-Z_][a-zA-Z0-9_]*(?:\.[a-zA-Z_][a-zA-Z0-9_]*)*)',
r'\bDELETE\s+FROM\s+([a-zA-Z_][a-zA-Z0-9_]*(?:\.[a-zA-Z_][a-zA-Z0-9_]*)*)'
]
tables = []
for pattern in patterns:
matches = re.findall(pattern, sql, re.IGNORECASE)
tables.extend(matches)
# Clean up table names (remove quotes, aliases, etc.)
cleaned_tables = []
for table in tables:
# Remove backticks, quotes, and get just the table name
clean_table = table.strip('`"\'').split(' ')[0]
if clean_table and not clean_table.upper() in ['SELECT', 'WHERE', 'AND', 'OR']:
cleaned_tables.append(clean_table)
return list(set(cleaned_tables))
def _classify_query_type(self, sql: str) -> str:
"""Classify SQL query type"""
if not sql:
return "unknown"
sql_upper = sql.upper().strip()
if sql_upper.startswith('SELECT'):
return "SELECT"
elif sql_upper.startswith('INSERT'):
return "INSERT"
elif sql_upper.startswith('UPDATE'):
return "UPDATE"
elif sql_upper.startswith('DELETE'):
return "DELETE"
elif sql_upper.startswith('CREATE'):
return "CREATE"
elif sql_upper.startswith('ALTER'):
return "ALTER"
elif sql_upper.startswith('DROP'):
return "DROP"
elif sql_upper.startswith('SHOW'):
return "SHOW"
elif sql_upper.startswith('DESCRIBE') or sql_upper.startswith('DESC'):
return "DESCRIBE"
else:
return "OTHER"
def _classify_access_pattern(self, hourly_pattern: List[int]) -> str:
"""Classify user access pattern based on hourly distribution"""
if not hourly_pattern or max(hourly_pattern) == 0:
return "no_pattern"
# Find peak hours
max_queries = max(hourly_pattern)
peak_hours = [i for i, count in enumerate(hourly_pattern) if count == max_queries]
# Business hours: 9-17
business_hours = set(range(9, 18))
peak_in_business_hours = any(hour in business_hours for hour in peak_hours)
# Night hours: 22-6
night_hours = set(list(range(22, 24)) + list(range(0, 7)))
peak_in_night_hours = any(hour in night_hours for hour in peak_hours)
if peak_in_business_hours and not peak_in_night_hours:
return "regular_business_hours"
elif peak_in_night_hours:
return "night_shift_or_batch"
elif len(peak_hours) > 6: # Distributed throughout day
return "distributed_access"
else:
return "irregular_pattern"
async def _analyze_role_access_patterns(self, connection, user_access_analysis: List[Dict]) -> Dict[str, Any]:
"""Analyze access patterns by role"""
try:
# Get user roles information
user_roles = await self._get_user_roles(connection)
# Group users by roles
role_stats = defaultdict(lambda: {
"user_count": 0,
"total_queries": 0,
"unique_tables": set(),
"query_types": Counter(),
"avg_queries_per_user": 0,
"users": []
})
# Process user access data
for user_data in user_access_analysis:
user_name = user_data["user_name"]
user_stats = user_data["access_stats"]
query_types = user_data["query_type_distribution"]
# Get user roles (default to 'unknown' if not found)
roles = user_roles.get(user_name, ["unknown"])
for role in roles:
stats = role_stats[role]
stats["user_count"] += 1
stats["total_queries"] += user_stats["total_queries"]
stats["users"].append(user_name)
# Aggregate query types
for query_type, count in query_types.items():
stats["query_types"][query_type] += count
# Calculate role analysis
role_analysis = {}
for role, stats in role_stats.items():
if stats["user_count"] > 0:
avg_queries = stats["total_queries"] / stats["user_count"]
# Calculate privilege usage (simplified)
total_role_queries = sum(stats["query_types"].values())
privilege_usage = {}
if total_role_queries > 0:
privilege_usage = {
query_type: round(count / total_role_queries, 3)
for query_type, count in stats["query_types"].items()
}
role_analysis[role] = {
"user_count": stats["user_count"],
"users": stats["users"],
"total_queries": stats["total_queries"],
"avg_queries_per_user": round(avg_queries, 1),
"query_type_distribution": dict(stats["query_types"]),
"privilege_usage": privilege_usage,
"activity_level": self._classify_role_activity_level(avg_queries)
}
return role_analysis
except Exception as e:
logger.warning(f"Failed to analyze role access patterns: {str(e)}")
return {}
async def _get_user_roles(self, connection) -> Dict[str, List[str]]:
"""Get user roles mapping"""
try:
# Try to get user role information
roles_sql = """
SELECT
User as user_name,
COALESCE(Default_role, 'default') as role_name
FROM mysql.user
"""
result = await connection.execute(roles_sql)
user_roles = defaultdict(list)
if result.data:
for row in result.data:
user_name = row.get("user_name", "")
role_name = row.get("role_name", "default")
if user_name:
user_roles[user_name].append(role_name)
return dict(user_roles)
except Exception as e:
logger.warning(f"Failed to get user roles: {str(e)}")
return {}
def _classify_role_activity_level(self, avg_queries: float) -> str:
"""Classify role activity level based on average queries"""
if avg_queries > 100:
return "high"
elif avg_queries > 20:
return "medium"
elif avg_queries > 5:
return "low"
else:
return "minimal"
async def _detect_security_anomalies(self, audit_data: List[Dict], user_access_analysis: List[Dict]) -> List[Dict]:
"""Detect potential security anomalies"""
alerts = []
# 1. Detect unusual access times
for user_data in user_access_analysis:
user_name = user_data["user_name"]
hourly_pattern = user_data["temporal_patterns"]["hourly_distribution"]
# Check for significant night-time activity
night_queries = sum(hourly_pattern[22:24]) + sum(hourly_pattern[0:6])
total_queries = sum(hourly_pattern)
if total_queries > 0 and night_queries / total_queries > 0.3: # >30% night activity
alerts.append({
"alert_type": "unusual_access_time",
"severity": "medium",
"user": user_name,
"description": f"User {user_name} has {night_queries/total_queries:.1%} of queries during night hours",
"night_query_percentage": round(night_queries/total_queries, 3),
"timestamp": datetime.now().isoformat()
})
# 2. Detect users with high failure rates
for user_data in user_access_analysis:
user_name = user_data["user_name"]
success_rate = user_data["access_stats"]["success_rate"]
total_queries = user_data["access_stats"]["total_queries"]
if total_queries > 10 and success_rate < 0.8: # <80% success rate
alerts.append({
"alert_type": "high_failure_rate",
"severity": "medium",
"user": user_name,
"description": f"User {user_name} has low query success rate ({success_rate:.1%})",
"success_rate": success_rate,
"total_queries": total_queries,
"timestamp": datetime.now().isoformat()
})
# 3. Detect unusual data volume access
data_volumes = [user["access_stats"]["data_volume_read_gb"] for user in user_access_analysis]
if data_volumes:
avg_volume = sum(data_volumes) / len(data_volumes)
std_dev = (sum((x - avg_volume) ** 2 for x in data_volumes) / len(data_volumes)) ** 0.5
threshold = avg_volume + 2 * std_dev # 2 standard deviations above mean
for user_data in user_access_analysis:
user_name = user_data["user_name"]
volume = user_data["access_stats"]["data_volume_read_gb"]
if volume > threshold and volume > 1.0: # >1GB and above threshold
alerts.append({
"alert_type": "unusual_data_volume",
"severity": "high" if volume > threshold * 2 else "medium",
"user": user_name,
"description": f"User {user_name} read {volume:.2f}GB (threshold: {threshold:.2f}GB)",
"data_volume_gb": volume,
"threshold_gb": round(threshold, 2),
"timestamp": datetime.now().isoformat()
})
# 4. Detect users accessing many different tables
for user_data in user_access_analysis:
user_name = user_data["user_name"]
unique_tables = user_data["access_stats"]["unique_tables_accessed"]
total_queries = user_data["access_stats"]["total_queries"]
# High table diversity might indicate privilege escalation or data mining
if unique_tables > 20 and total_queries > 50:
alerts.append({
"alert_type": "broad_table_access",
"severity": "medium",
"user": user_name,
"description": f"User {user_name} accessed {unique_tables} different tables",
"unique_tables_count": unique_tables,
"total_queries": total_queries,
"timestamp": datetime.now().isoformat()
})
return sorted(alerts, key=lambda x: {"high": 3, "medium": 2, "low": 1}.get(x["severity"], 0), reverse=True)
async def _generate_access_insights(self, user_access_analysis: List[Dict], role_analysis: Dict[str, Any]) -> Dict[str, Any]:
"""Generate access insights and patterns"""
insights = {
"user_behavior_patterns": {},
"role_effectiveness": {},
"security_posture": {}
}
# User behavior patterns
if user_access_analysis:
total_users = len(user_access_analysis)
active_users = len([u for u in user_access_analysis if u["access_stats"]["total_queries"] > 10])
power_users = len([u for u in user_access_analysis if u["access_stats"]["total_queries"] > 100])
# Access pattern distribution
pattern_distribution = Counter(
user["access_stats"]["access_pattern"] for user in user_access_analysis
)
insights["user_behavior_patterns"] = {
"total_users_analyzed": total_users,
"active_users": active_users,
"power_users": power_users,
"access_pattern_distribution": dict(pattern_distribution),
"avg_queries_per_user": round(
sum(u["access_stats"]["total_queries"] for u in user_access_analysis) / total_users, 1
) if total_users > 0 else 0
}
# Role effectiveness
if role_analysis:
most_active_role = max(role_analysis.items(), key=lambda x: x[1]["total_queries"])
least_active_role = min(role_analysis.items(), key=lambda x: x[1]["total_queries"])
insights["role_effectiveness"] = {
"total_roles": len(role_analysis),
"most_active_role": {
"role": most_active_role[0],
"total_queries": most_active_role[1]["total_queries"],
"user_count": most_active_role[1]["user_count"]
},
"least_active_role": {
"role": least_active_role[0],
"total_queries": least_active_role[1]["total_queries"],
"user_count": least_active_role[1]["user_count"]
},
"avg_users_per_role": round(
sum(role_info["user_count"] for role_info in role_analysis.values()) / len(role_analysis), 1
)
}
# Security posture assessment
if user_access_analysis:
users_with_failures = len([u for u in user_access_analysis if u["access_stats"]["failed_queries"] > 0])
users_night_access = len([
u for u in user_access_analysis
if any(u["temporal_patterns"]["hourly_distribution"][hour] > 0 for hour in list(range(22, 24)) + list(range(0, 6)))
])
insights["security_posture"] = {
"users_with_query_failures": users_with_failures,
"users_with_night_access": users_night_access,
"security_score": self._calculate_security_score(user_access_analysis),
"risk_level": self._assess_overall_risk_level(user_access_analysis)
}
return insights
def _calculate_security_score(self, user_access_analysis: List[Dict]) -> float:
"""Calculate overall security score (0-1, higher is better)"""
if not user_access_analysis:
return 0.0
total_users = len(user_access_analysis)
# Factors that contribute to security score
users_with_high_success_rate = len([u for u in user_access_analysis if u["access_stats"]["success_rate"] > 0.9])
users_with_normal_patterns = len([u for u in user_access_analysis if u["access_stats"]["access_pattern"] == "regular_business_hours"])
success_rate_score = users_with_high_success_rate / total_users
pattern_score = users_with_normal_patterns / total_users
# Combined score
overall_score = (success_rate_score * 0.6 + pattern_score * 0.4)
return round(overall_score, 3)
def _assess_overall_risk_level(self, user_access_analysis: List[Dict]) -> str:
"""Assess overall security risk level"""
security_score = self._calculate_security_score(user_access_analysis)
if security_score > 0.8:
return "low"
elif security_score > 0.6:
return "medium"
else:
return "high"
def _generate_user_access_summary(self, user_access_analysis: List[Dict]) -> Dict[str, Any]:
"""Generate summary statistics for user access"""
if not user_access_analysis:
return {
"total_users": 0,
"active_users": 0,
"high_activity_users": 0,
"dormant_users": 0
}
total_users = len(user_access_analysis)
active_users = len([u for u in user_access_analysis if u["access_stats"]["total_queries"] > 10])
high_activity_users = len([u for u in user_access_analysis if u["access_stats"]["total_queries"] > 100])
dormant_users = total_users - active_users
return {
"total_users": total_users,
"active_users": active_users,
"high_activity_users": high_activity_users,
"dormant_users": dormant_users,
"activity_distribution": {
"high": high_activity_users,
"medium": active_users - high_activity_users,
"low": dormant_users
}
}
def _generate_security_recommendations(self, security_alerts: List[Dict], access_insights: Dict[str, Any]) -> List[Dict]:
"""Generate security recommendations based on analysis"""
recommendations = []
# Recommendations based on alerts
if security_alerts:
high_severity_alerts = [alert for alert in security_alerts if alert["severity"] == "high"]
if high_severity_alerts:
recommendations.append({
"type": "urgent_security_review",
"priority": "high",
"description": f"Found {len(high_severity_alerts)} high-severity security alerts",
"action": "Immediate review of flagged users and access patterns required",
"affected_users": list(set(alert["user"] for alert in high_severity_alerts if "user" in alert))
})
# Night access recommendations
night_access_alerts = [alert for alert in security_alerts if alert["alert_type"] == "unusual_access_time"]
if night_access_alerts:
recommendations.append({
"type": "access_time_policy",
"priority": "medium",
"description": f"{len(night_access_alerts)} users have significant night-time access",
"action": "Review access time policies and consider time-based restrictions",
"affected_users": [alert["user"] for alert in night_access_alerts]
})
# Recommendations based on insights
security_posture = access_insights.get("security_posture", {})
risk_level = security_posture.get("risk_level", "unknown")
if risk_level == "high":
recommendations.append({
"type": "overall_security_improvement",
"priority": "high",
"description": "Overall security posture indicates high risk",
"action": "Comprehensive security audit and policy review recommended"
})
# Role-based recommendations
role_effectiveness = access_insights.get("role_effectiveness", {})
if role_effectiveness and role_effectiveness.get("total_roles", 0) < 3:
recommendations.append({
"type": "role_management",
"priority": "medium",
"description": "Limited role diversity detected",
"action": "Consider implementing more granular role-based access control"
})
return recommendations

147
examples/cursor/README.md Normal file
View File

@@ -0,0 +1,147 @@
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
# Cursor Example: Integrating Doris MCP Server
This guide provides step-by-step instructions on how to integrate the `doris-mcp-server` with the [Cursor](https://cursor.sh/) IDE. This integration allows you to interact with your Apache Doris database using natural language queries directly within Cursor's AI chat.
## Table of Contents
* [Prerequisites](#prerequisites)
* [Step 1: Set Up the Project](#step-1-set-up-the-project)
* [Step 2: Configure the MCP Server in Cursor](#step-2-configure-the-mcp-server-in-cursor)
* [Step 3: Verify the Integration](#step-3-verify-the-integration)
* [Step 4: Query Your Database](#step-4-query-your-database)
* [Example 1: List Tables](#example-1-list-tables)
* [Example 2: Analyze Sales Trends](#example-2-analyze-sales-trends)
---
### Prerequisites
Before you begin, ensure you have the following installed and configured:
* The **Cursor** IDE
* **Git** for cloning the repository
* Access to an **Apache Doris** cluster (FE host, port, username, and password)
* **uv**, a fast Python package installer and runner
You can install `uv` with one of the following commands:
```bash
# For macOS (recommended)
brew install uv
# For other systems using pipx
pipx install uv
```
---
### Step 1: Set Up the Project
First, clone the `doris-mcp-server` repository to your local machine:
```bash
git clone https://github.com/apache/doris-mcp-server.git
cd doris-mcp-server
```
The necessary dependencies are listed in `requirements.txt` and will be managed automatically by `uv` in the next step.
---
### Step 2: Configure the MCP Server in Cursor
1. Open the cloned `doris-mcp-server` directory in Cursor.
2. Click the ⚙️ icon (top-right), then go to **Tools & Integrations**.
![add MCP Server](../images/cursor_add_mcp.png)
3. Click **Add a custom MCP Server**.
4. Paste the following JSON configuration:
```json
{
"mcpServers": {
"doris-mcp": {
"command": "uv",
"args": [
"run",
"--project",
"/path/to/your/doris-mcp-server",
"doris-mcp-server"
],
"env": {
"DORIS_HOST": "your_doris_fe_host",
"DORIS_PORT": "9030",
"DORIS_USER": "your_username",
"DORIS_PASSWORD": "your_password",
"DORIS_DATABASE": "ssb"
}
}
}
}
```
> ⚠️ **Important:**
>
> * Replace `"/path/to/your/doris-mcp-server"` with the **absolute path** to your local project directory.
> * Fill in your actual Doris FE host, username, password, and database name.
---
### Step 3: Verify the Integration
Once saved, go back to the **Settings** panel. If everything is configured correctly, youll see a green status dot next to `doris-mcp-server`, along with available tools like `exec_query`.
![MCP Server](../images/cursor_doris-mcp.png)
---
### Step 4: Query Your Database
You can now chat with Cursor Agent to run SQL queries against your Doris database.
1. Open the chat panel using `Cmd + K` (macOS) or `Ctrl + K` (Windows/Linux), or click the chat icon in the top-right.
2. Switch to **Agent Mode**.
3. Start asking questions using natural language.
![ask](../images/cursor_agent.png)
---
#### Example 1: List Tables
> **Prompt:** What tables are in the `ssb` database?
The agent will call the `get_db_table_list` tool and return the results.
![ask](../images/cursor_ask1.png)
---
#### Example 2: Analyze Sales Trends
> **Prompt:** What has been the sales trend over the past ten years in the `ssb` database, and which year had the fastest growth?
The agent will generate an appropriate SQL query, send it to the MCP server, and interpret the results to give you growth trends and highlights.
![ask](../images/cursor_ask2.png)

View File

@@ -103,6 +103,9 @@ If your Dify deployment requires a publicly accessible endpoint, you can use the
2. Select **Agent** as the template and set the **App Name** (e.g., `Doris ChatBI`).
![Agent setup](../images/dify_agent_setup.png)
3. Import from DSL,[dify_doris_dsl.yml](dify_doris_dsl.yml)
-----
## Instructions & Tool Configuration

View File

@@ -0,0 +1,127 @@
app:
description: ''
icon: 🤖
icon_background: '#FFEAD5'
mode: agent-chat
name: doris
use_icon_as_answer_icon: false
dependencies:
- current_identifier: null
type: marketplace
value:
marketplace_plugin_unique_identifier: langgenius/deepseek:0.0.5@21408d5c48cd9f18d66b08883d0999fe89e6d049c891324c2229dea23b9665d5
- current_identifier: null
type: marketplace
value:
marketplace_plugin_unique_identifier: junjiem/mcp_sse:0.2.1@53cc613667fcf91dd7208dd5f6d2c8df3c7ff0af8b79e8f3c0a430f1b39bda4c
kind: app
model_config:
agent_mode:
enabled: true
max_iteration: 10
prompt: null
strategy: function_call
tools:
- enabled: true
isDeleted: false
notAuthor: false
provider_id: junjiem/mcp_sse/mcp_sse
provider_name: junjiem/mcp_sse/mcp_sse
provider_type: builtin
tool_label: 获取 MCP 工具列表
tool_name: mcp_sse_list_tools
tool_parameters:
prompts_as_tools: 1
resources_as_tools: 1
servers_config: null
- enabled: true
isDeleted: false
notAuthor: false
provider_id: junjiem/mcp_sse/mcp_sse
provider_name: junjiem/mcp_sse/mcp_sse
provider_type: builtin
tool_label: 调用 MCP 工具
tool_name: mcp_sse_call_tool
tool_parameters:
arguments: ''
prompts_as_tools: ''
resources_as_tools: ''
servers_config: ''
tool_name: ''
annotation_reply:
enabled: false
chat_prompt_config: {}
completion_prompt_config: {}
dataset_configs:
datasets:
datasets: []
reranking_enable: true
reranking_mode: reranking_model
reranking_model:
reranking_model_name: ''
reranking_provider_name: ''
retrieval_model: multiple
top_k: 4
dataset_query_variable: ''
external_data_tools: []
file_upload:
allowed_file_extensions:
- .JPG
- .JPEG
- .PNG
- .GIF
- .WEBP
- .SVG
- .MP4
- .MOV
- .MPEG
- .WEBM
allowed_file_types: []
allowed_file_upload_methods:
- remote_url
- local_file
enabled: false
image:
detail: high
enabled: false
number_limits: 3
transfer_methods:
- remote_url
- local_file
number_limits: 3
model:
completion_params:
stop: []
mode: chat
name: deepseek-chat
provider: langgenius/deepseek/deepseek
more_like_this:
enabled: false
opening_statement: ''
pre_prompt: "<instruction>\nUse MCP tools to complete tasks as much as possible.\
\ Carefully read the annotations, method names, and parameter descriptions of\
\ each tool. Please follow these steps:\n1. Analyze the user's question and match\
\ the most appropriate tool.\n2. Use tool names and parameters exactly as defined;\
\ do not invent new ones.\n3. Pass parameters in the required JSON format.\n4.\
\ When calling tools, use:\n {\"mcp_sse_call_tool\": {\"tool_name\": \"<tool_name>\"\
, \"arguments\": \"{}\"}}\n5. Output plain text only—no XML tags.\n<input>\nUser\
\ question: user_query\n</input>\n<output>\nReturn tool results or a final answer,\
\ including analysis.\n</output>\n</instruction>"
prompt_type: simple
retriever_resource:
enabled: true
sensitive_word_avoidance:
configs: []
enabled: false
type: ''
speech_to_text:
enabled: false
suggested_questions: []
suggested_questions_after_answer:
enabled: false
text_to_speech:
enabled: false
language: ''
voice: ''
user_input_form: []
version: 0.3.0

Binary file not shown.

After

Width:  |  Height:  |  Size: 323 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 673 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 118 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 232 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 80 KiB

View File

@@ -20,7 +20,7 @@ build-backend = "hatchling.build"
[project]
name = "doris-mcp-server"
version = "0.4.2"
version = "0.5.1"
description = "Enterprise-grade Model Context Protocol (MCP) server implementation for Apache Doris"
authors = [
{name = "Yijia Su", email = "freeoneplus@apache.org"}
@@ -46,6 +46,10 @@ dependencies = [
# Database drivers
"aiomysql>=0.2.0",
"PyMySQL>=1.1.0",
# ADBC (Arrow Flight SQL) dependencies
"adbc-driver-manager>=0.8.0",
"adbc-driver-flightsql>=0.8.0",
"pyarrow>=14.0.0",
# Async and utility libraries
"asyncio-mqtt>=0.16.0",
"aiofiles>=23.0.0",

View File

@@ -5,6 +5,9 @@
mcp>=1.8.0,<2.0.0
aiomysql>=0.2.0
PyMySQL>=1.1.0
adbc-driver-manager>=0.8.0
adbc-driver-flightsql>=0.8.0
pyarrow>=14.0.0
asyncio-mqtt>=0.16.0
aiofiles>=23.0.0
aiohttp>=3.9.0

420
uv.lock generated
View File

@@ -6,6 +6,48 @@ resolution-markers = [
"python_full_version < '3.13'",
]
[[package]]
name = "adbc-driver-flightsql"
version = "1.7.0"
source = { registry = "https://pypi.org/simple" }
dependencies = [
{ name = "adbc-driver-manager" },
{ name = "importlib-resources" },
]
sdist = { url = "https://files.pythonhosted.org/packages/b8/d4/ebd3eed981c771565677084474cdf465141455b5deb1ca409c616609bfd7/adbc_driver_flightsql-1.7.0.tar.gz", hash = "sha256:5dca460a2c66e45b29208eaf41a7206f252177435fa48b16f19833b12586f7a0", size = 21247 }
wheels = [
{ url = "https://files.pythonhosted.org/packages/36/20/807fca9d904b7e0d3020439828d6410db7fd7fd635824a80cab113d9fad1/adbc_driver_flightsql-1.7.0-py3-none-macosx_10_15_x86_64.whl", hash = "sha256:a5658f9bc3676bd122b26138e9b9ce56b8bf37387efe157b4c66d56f942361c6", size = 7749664 },
{ url = "https://files.pythonhosted.org/packages/cd/e6/9e50f6497819c911b9cc1962ffde610b60f7d8e951d6bb3fa145dcfb50a7/adbc_driver_flightsql-1.7.0-py3-none-macosx_11_0_arm64.whl", hash = "sha256:65e21df86b454d8db422c8ee22db31be217d88c42d9d6dd89119f06813037c91", size = 7302476 },
{ url = "https://files.pythonhosted.org/packages/27/82/e51af85e7cc8c87bc8ce4fae8ca7ee1d3cf39c926be0aeab789cedc93f0a/adbc_driver_flightsql-1.7.0-py3-none-manylinux1_x86_64.manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_5_x86_64.whl", hash = "sha256:3282fdc7b73c712780cc777975288c88b1e3a555355bbe09df101aa954f8f105", size = 7686056 },
{ url = "https://files.pythonhosted.org/packages/8b/c9/591c8ecbaf010ba3f4b360db602050ee5880cd077a573c9e90fcb270ab71/adbc_driver_flightsql-1.7.0-py3-none-manylinux2014_aarch64.manylinux_2_17_aarch64.whl", hash = "sha256:e0c5737ae6ee3bbfba44dcbc28ba1ff8cf3ab6521888c4b0f10dd6a482482161", size = 7050275 },
{ url = "https://files.pythonhosted.org/packages/10/14/f339e9a5d8dbb3e3040215514cea9cca0a58640964aaccc6532f18003a03/adbc_driver_flightsql-1.7.0-py3-none-win_amd64.whl", hash = "sha256:f8b5290b322304b7d944ca823754e6354c1868dbbe94ddf84236f3e0329545da", size = 14312858 },
]
[[package]]
name = "adbc-driver-manager"
version = "1.7.0"
source = { registry = "https://pypi.org/simple" }
dependencies = [
{ name = "typing-extensions" },
]
sdist = { url = "https://files.pythonhosted.org/packages/bb/bf/2986a2cd3e1af658d2597f7e2308564e5c11e036f9736d5c256f1e00d578/adbc_driver_manager-1.7.0.tar.gz", hash = "sha256:e3edc5d77634b5925adf6eb4fbcd01676b54acb2f5b1d6864b6a97c6a899591a", size = 198128 }
wheels = [
{ url = "https://files.pythonhosted.org/packages/74/3a/72bd9c45d55f1f5f4c549e206de8cfe3313b31f7b95fbcb180da05c81044/adbc_driver_manager-1.7.0-cp312-cp312-macosx_10_15_x86_64.whl", hash = "sha256:8da1ac4c19bcbf30b3bd54247ec889dfacc9b44147c70b4da79efe2e9ba93600", size = 524210 },
{ url = "https://files.pythonhosted.org/packages/33/29/e1a8d8dde713a287f8021f3207127f133ddce578711a4575218bdf78ef27/adbc_driver_manager-1.7.0-cp312-cp312-macosx_11_0_arm64.whl", hash = "sha256:408bc23bad1a6823b364e2388f85f96545e82c3b2db97d7828a4b94839d3f29e", size = 505902 },
{ url = "https://files.pythonhosted.org/packages/59/00/773ece64a58c0ade797ab4577e7cdc4c71ebf800b86d2d5637e3bfe605e9/adbc_driver_manager-1.7.0-cp312-cp312-manylinux2014_aarch64.manylinux_2_17_aarch64.whl", hash = "sha256:cf38294320c23e47ed3455348e910031ad8289c3f9167ae35519ac957b7add01", size = 2974883 },
{ url = "https://files.pythonhosted.org/packages/7c/ad/1568da6ae9ab70983f1438503d3906c6b1355601230e891d16e272376a04/adbc_driver_manager-1.7.0-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.whl", hash = "sha256:689f91b62c18a9f86f892f112786fb157cacc4729b4d81666db4ca778eade2a8", size = 2997781 },
{ url = "https://files.pythonhosted.org/packages/19/66/2b6ea5afded25a3fa009873c2bbebcd9283910877cc10b9453d680c00b9a/adbc_driver_manager-1.7.0-cp312-cp312-win_amd64.whl", hash = "sha256:f936cfc8d098898a47ef60396bd7a73926ec3068f2d6d92a2be4e56e4aaf3770", size = 690041 },
{ url = "https://files.pythonhosted.org/packages/b2/3b/91154c83a98f103a3d97c9e2cb838c3842aef84ca4f4b219164b182d9516/adbc_driver_manager-1.7.0-cp313-cp313-macosx_10_15_x86_64.whl", hash = "sha256:ab9ee36683fd54f61b0db0f4a96f70fe1932223e61df9329290370b145abb0a9", size = 522737 },
{ url = "https://files.pythonhosted.org/packages/9c/52/4bc80c3388d5e2a3b6e504ba9656dd9eb3d8dbe822d07af38db1b8c96fb1/adbc_driver_manager-1.7.0-cp313-cp313-macosx_11_0_arm64.whl", hash = "sha256:4ec03d94177f71a8d3a149709f4111e021f9950229b35c0a803aadb1a1855a4b", size = 503896 },
{ url = "https://files.pythonhosted.org/packages/e1/f3/46052ca11224f661cef4721e19138bc73e750ba6aea54f22606950491606/adbc_driver_manager-1.7.0-cp313-cp313-manylinux2014_aarch64.manylinux_2_17_aarch64.whl", hash = "sha256:700c79dac08a620018c912ede45a6dc7851819bc569a53073ab652dc0bd0c92f", size = 2972586 },
{ url = "https://files.pythonhosted.org/packages/a2/22/44738b41bb5ca30f94b5f4c00c71c20be86d7eb4ddc389d4cf3c7b8b69ef/adbc_driver_manager-1.7.0-cp313-cp313-manylinux2014_x86_64.manylinux_2_17_x86_64.whl", hash = "sha256:98db0f5d0aa1635475f63700a7b6f677390beb59c69c7ba9d388bc8ce3779388", size = 2992001 },
{ url = "https://files.pythonhosted.org/packages/1b/2b/5184fe5a529feb019582cc90d0f65e0021d52c34ca20620551532340645a/adbc_driver_manager-1.7.0-cp313-cp313-win_amd64.whl", hash = "sha256:4b7e5e9a163acb21804647cc7894501df51cdcd780ead770557112a26ca01ca6", size = 688789 },
{ url = "https://files.pythonhosted.org/packages/3f/e0/b283544e1bb7864bf5a5ac9cd330f111009eff9180ec5000420510cf9342/adbc_driver_manager-1.7.0-cp313-cp313t-macosx_10_15_x86_64.whl", hash = "sha256:ac83717965b83367a8ad6c0536603acdcfa66e0592d783f8940f55fda47d963e", size = 538625 },
{ url = "https://files.pythonhosted.org/packages/77/5a/dc244264bd8d0c331a418d2bdda5cb6e26c30493ff075d706aa81d4e3b30/adbc_driver_manager-1.7.0-cp313-cp313t-macosx_11_0_arm64.whl", hash = "sha256:4c234cf81b00eaf7e7c65dbd0f0ddf7bdae93dfcf41e9d8543f9ecf4b10590f6", size = 523627 },
{ url = "https://files.pythonhosted.org/packages/e9/ff/a499a00367fd092edb20dc6e36c81e3c7a437671c70481cae97f46c8156a/adbc_driver_manager-1.7.0-cp313-cp313t-manylinux2014_aarch64.manylinux_2_17_aarch64.whl", hash = "sha256:ad8aa4b039cc50722a700b544773388c6b1dea955781a01f79cd35d0a1e6edbf", size = 3037517 },
{ url = "https://files.pythonhosted.org/packages/25/6e/9dfdb113294dcb24b4f53924cd4a9c9af3fbe45a9790c1327048df731246/adbc_driver_manager-1.7.0-cp313-cp313t-manylinux2014_x86_64.manylinux_2_17_x86_64.whl", hash = "sha256:4409ff53578e01842a8f57787ebfbfee790c1da01a6bd57fcb7701ed5d4dd4f7", size = 3016543 },
]
[[package]]
name = "aiofiles"
version = "24.1.0"
@@ -518,6 +560,176 @@ wheels = [
{ url = "https://files.pythonhosted.org/packages/8f/d7/9322c609343d929e75e7e5e6255e614fcc67572cfd083959cdef3b7aad79/docutils-0.21.2-py3-none-any.whl", hash = "sha256:dafca5b9e384f0e419294eb4d2ff9fa826435bf15f15b7bd45723e8ad76811b2", size = 587408 },
]
[[package]]
name = "doris-mcp-server"
version = "0.5.0"
source = { editable = "." }
dependencies = [
{ name = "adbc-driver-flightsql" },
{ name = "adbc-driver-manager" },
{ name = "aiofiles" },
{ name = "aiohttp" },
{ name = "aiomysql" },
{ name = "aioredis" },
{ name = "asyncio-mqtt" },
{ name = "bcrypt" },
{ name = "click" },
{ name = "cryptography" },
{ name = "fastapi" },
{ name = "httpx" },
{ name = "mcp" },
{ name = "numpy" },
{ name = "orjson" },
{ name = "pandas" },
{ name = "passlib", extra = ["bcrypt"] },
{ name = "prometheus-client" },
{ name = "pyarrow" },
{ name = "pydantic" },
{ name = "pydantic-settings" },
{ name = "pyjwt" },
{ name = "pymysql" },
{ name = "pytest" },
{ name = "pytest-asyncio" },
{ name = "pytest-cov" },
{ name = "python-dateutil" },
{ name = "python-dotenv" },
{ name = "python-jose", extra = ["cryptography"] },
{ name = "python-multipart" },
{ name = "pyyaml" },
{ name = "requests" },
{ name = "rich" },
{ name = "sqlparse" },
{ name = "starlette" },
{ name = "structlog" },
{ name = "toml" },
{ name = "tqdm" },
{ name = "typer" },
{ name = "uvicorn", extra = ["standard"] },
{ name = "websockets" },
]
[package.optional-dependencies]
dev = [
{ name = "bandit" },
{ name = "black" },
{ name = "flake8" },
{ name = "isort" },
{ name = "mypy" },
{ name = "myst-parser" },
{ name = "pre-commit" },
{ name = "pytest" },
{ name = "pytest-asyncio" },
{ name = "pytest-cov" },
{ name = "pytest-mock" },
{ name = "pytest-xdist" },
{ name = "ruff" },
{ name = "safety" },
{ name = "sphinx" },
{ name = "sphinx-rtd-theme" },
{ name = "tox" },
]
docs = [
{ name = "myst-parser" },
{ name = "sphinx" },
{ name = "sphinx-autoapi" },
{ name = "sphinx-rtd-theme" },
]
monitoring = [
{ name = "grafana-client" },
{ name = "jaeger-client" },
{ name = "opentelemetry-api" },
{ name = "opentelemetry-sdk" },
{ name = "prometheus-client" },
]
performance = [
{ name = "cchardet" },
{ name = "orjson" },
{ name = "uvloop" },
]
[package.dev-dependencies]
dev = [
{ name = "ruff" },
]
[package.metadata]
requires-dist = [
{ name = "adbc-driver-flightsql", specifier = ">=0.8.0" },
{ name = "adbc-driver-manager", specifier = ">=0.8.0" },
{ name = "aiofiles", specifier = ">=23.0.0" },
{ name = "aiohttp", specifier = ">=3.9.0" },
{ name = "aiomysql", specifier = ">=0.2.0" },
{ name = "aioredis", specifier = ">=2.0.0" },
{ name = "asyncio-mqtt", specifier = ">=0.16.0" },
{ name = "bandit", marker = "extra == 'dev'", specifier = ">=1.7.0" },
{ name = "bcrypt", specifier = ">=4.1.0" },
{ name = "black", marker = "extra == 'dev'", specifier = ">=23.12.0" },
{ name = "cchardet", marker = "extra == 'performance'", specifier = ">=2.1.0" },
{ name = "click", specifier = ">=8.1.0" },
{ name = "cryptography", specifier = ">=41.0.0" },
{ name = "fastapi", specifier = ">=0.108.0" },
{ name = "flake8", marker = "extra == 'dev'", specifier = ">=7.0.0" },
{ name = "grafana-client", marker = "extra == 'monitoring'", specifier = ">=3.5.0" },
{ name = "httpx", specifier = ">=0.26.0" },
{ name = "isort", marker = "extra == 'dev'", specifier = ">=5.13.0" },
{ name = "jaeger-client", marker = "extra == 'monitoring'", specifier = ">=4.8.0" },
{ name = "mcp", specifier = ">=1.8.0,<2.0.0" },
{ name = "mypy", marker = "extra == 'dev'", specifier = ">=1.8.0" },
{ name = "myst-parser", marker = "extra == 'dev'", specifier = ">=2.0.0" },
{ name = "myst-parser", marker = "extra == 'docs'", specifier = ">=2.0.0" },
{ name = "numpy", specifier = ">=1.24.0" },
{ name = "opentelemetry-api", marker = "extra == 'monitoring'", specifier = ">=1.21.0" },
{ name = "opentelemetry-sdk", marker = "extra == 'monitoring'", specifier = ">=1.21.0" },
{ name = "orjson", specifier = ">=3.9.0" },
{ name = "orjson", marker = "extra == 'performance'", specifier = ">=3.9.0" },
{ name = "pandas", specifier = ">=2.0.0" },
{ name = "passlib", extras = ["bcrypt"], specifier = ">=1.7.0" },
{ name = "pre-commit", marker = "extra == 'dev'", specifier = ">=3.6.0" },
{ name = "prometheus-client", specifier = ">=0.19.0" },
{ name = "prometheus-client", marker = "extra == 'monitoring'", specifier = ">=0.19.0" },
{ name = "pyarrow", specifier = ">=14.0.0" },
{ name = "pydantic", specifier = ">=2.5.0" },
{ name = "pydantic-settings", specifier = ">=2.1.0" },
{ name = "pyjwt", specifier = ">=2.8.0" },
{ name = "pymysql", specifier = ">=1.1.0" },
{ name = "pytest", specifier = ">=8.4.0" },
{ name = "pytest", marker = "extra == 'dev'", specifier = ">=7.4.0" },
{ name = "pytest-asyncio", specifier = ">=1.0.0" },
{ name = "pytest-asyncio", marker = "extra == 'dev'", specifier = ">=0.23.0" },
{ name = "pytest-cov", specifier = ">=6.1.1" },
{ name = "pytest-cov", marker = "extra == 'dev'", specifier = ">=4.1.0" },
{ name = "pytest-mock", marker = "extra == 'dev'", specifier = ">=3.12.0" },
{ name = "pytest-xdist", marker = "extra == 'dev'", specifier = ">=3.5.0" },
{ name = "python-dateutil", specifier = ">=2.8.0" },
{ name = "python-dotenv", specifier = ">=1.0.0" },
{ name = "python-jose", extras = ["cryptography"], specifier = ">=3.3.0" },
{ name = "python-multipart", specifier = ">=0.0.6" },
{ name = "pyyaml", specifier = ">=6.0.0" },
{ name = "requests", specifier = ">=2.31.0" },
{ name = "rich", specifier = ">=13.7.0" },
{ name = "ruff", marker = "extra == 'dev'", specifier = ">=0.1.0" },
{ name = "safety", marker = "extra == 'dev'", specifier = ">=2.3.0" },
{ name = "sphinx", marker = "extra == 'dev'", specifier = ">=7.2.0" },
{ name = "sphinx", marker = "extra == 'docs'", specifier = ">=7.2.0" },
{ name = "sphinx-autoapi", marker = "extra == 'docs'", specifier = ">=3.0.0" },
{ name = "sphinx-rtd-theme", marker = "extra == 'dev'", specifier = ">=2.0.0" },
{ name = "sphinx-rtd-theme", marker = "extra == 'docs'", specifier = ">=2.0.0" },
{ name = "sqlparse", specifier = ">=0.4.4" },
{ name = "starlette", specifier = ">=0.27.0" },
{ name = "structlog", specifier = ">=23.2.0" },
{ name = "toml", specifier = ">=0.10.0" },
{ name = "tox", marker = "extra == 'dev'", specifier = ">=4.11.0" },
{ name = "tqdm", specifier = ">=4.66.0" },
{ name = "typer", specifier = ">=0.9.0" },
{ name = "uvicorn", extras = ["standard"], specifier = ">=0.25.0" },
{ name = "uvloop", marker = "extra == 'performance'", specifier = ">=0.19.0" },
{ name = "websockets", specifier = ">=12.0" },
]
provides-extras = ["dev", "docs", "performance", "monitoring"]
[package.metadata.requires-dev]
dev = [{ name = "ruff", specifier = ">=0.11.13" }]
[[package]]
name = "dparse"
version = "0.6.4"
@@ -768,6 +980,15 @@ wheels = [
{ url = "https://files.pythonhosted.org/packages/20/b0/36bd937216ec521246249be3bf9855081de4c5e06a0c9b4219dbeda50373/importlib_metadata-8.7.0-py3-none-any.whl", hash = "sha256:e5dd1551894c77868a30651cef00984d50e1002d06942a7101d34870c5f02afd", size = 27656 },
]
[[package]]
name = "importlib-resources"
version = "6.5.2"
source = { registry = "https://pypi.org/simple" }
sdist = { url = "https://files.pythonhosted.org/packages/cf/8c/f834fbf984f691b4f7ff60f50b514cc3de5cc08abfc3295564dd89c5e2e7/importlib_resources-6.5.2.tar.gz", hash = "sha256:185f87adef5bcc288449d98fb4fba07cea78bc036455dd44c5fc4a2fe78fed2c", size = 44693 }
wheels = [
{ url = "https://files.pythonhosted.org/packages/a4/ed/1f1afb2e9e7f38a545d628f864d562a5ae64fe6f7a10e28ffb9b185b4e89/importlib_resources-6.5.2-py3-none-any.whl", hash = "sha256:789cfdc3ed28c78b67a06acb8126751ced69a3d5f79c095a98298cd8a760ccec", size = 37461 },
]
[[package]]
name = "iniconfig"
version = "2.1.0"
@@ -946,170 +1167,6 @@ wheels = [
{ url = "https://files.pythonhosted.org/packages/79/45/823ad05504bea55cb0feb7470387f151252127ad5c72f8882e8fe6cf5c0e/mcp-1.9.3-py3-none-any.whl", hash = "sha256:69b0136d1ac9927402ed4cf221d4b8ff875e7132b0b06edd446448766f34f9b9", size = 131063 },
]
[[package]]
name = "mcp-doris-server"
version = "0.4.2"
source = { editable = "." }
dependencies = [
{ name = "aiofiles" },
{ name = "aiohttp" },
{ name = "aiomysql" },
{ name = "aioredis" },
{ name = "asyncio-mqtt" },
{ name = "bcrypt" },
{ name = "click" },
{ name = "cryptography" },
{ name = "fastapi" },
{ name = "httpx" },
{ name = "mcp" },
{ name = "numpy" },
{ name = "orjson" },
{ name = "pandas" },
{ name = "passlib", extra = ["bcrypt"] },
{ name = "prometheus-client" },
{ name = "pydantic" },
{ name = "pydantic-settings" },
{ name = "pyjwt" },
{ name = "pymysql" },
{ name = "pytest" },
{ name = "pytest-asyncio" },
{ name = "pytest-cov" },
{ name = "python-dateutil" },
{ name = "python-dotenv" },
{ name = "python-jose", extra = ["cryptography"] },
{ name = "python-multipart" },
{ name = "pyyaml" },
{ name = "requests" },
{ name = "rich" },
{ name = "sqlparse" },
{ name = "starlette" },
{ name = "structlog" },
{ name = "toml" },
{ name = "tqdm" },
{ name = "typer" },
{ name = "uvicorn", extra = ["standard"] },
{ name = "websockets" },
]
[package.optional-dependencies]
dev = [
{ name = "bandit" },
{ name = "black" },
{ name = "flake8" },
{ name = "isort" },
{ name = "mypy" },
{ name = "myst-parser" },
{ name = "pre-commit" },
{ name = "pytest" },
{ name = "pytest-asyncio" },
{ name = "pytest-cov" },
{ name = "pytest-mock" },
{ name = "pytest-xdist" },
{ name = "ruff" },
{ name = "safety" },
{ name = "sphinx" },
{ name = "sphinx-rtd-theme" },
{ name = "tox" },
]
docs = [
{ name = "myst-parser" },
{ name = "sphinx" },
{ name = "sphinx-autoapi" },
{ name = "sphinx-rtd-theme" },
]
monitoring = [
{ name = "grafana-client" },
{ name = "jaeger-client" },
{ name = "opentelemetry-api" },
{ name = "opentelemetry-sdk" },
{ name = "prometheus-client" },
]
performance = [
{ name = "cchardet" },
{ name = "orjson" },
{ name = "uvloop" },
]
[package.dev-dependencies]
dev = [
{ name = "ruff" },
]
[package.metadata]
requires-dist = [
{ name = "aiofiles", specifier = ">=23.0.0" },
{ name = "aiohttp", specifier = ">=3.9.0" },
{ name = "aiomysql", specifier = ">=0.2.0" },
{ name = "aioredis", specifier = ">=2.0.0" },
{ name = "asyncio-mqtt", specifier = ">=0.16.0" },
{ name = "bandit", marker = "extra == 'dev'", specifier = ">=1.7.0" },
{ name = "bcrypt", specifier = ">=4.1.0" },
{ name = "black", marker = "extra == 'dev'", specifier = ">=23.12.0" },
{ name = "cchardet", marker = "extra == 'performance'", specifier = ">=2.1.0" },
{ name = "click", specifier = ">=8.1.0" },
{ name = "cryptography", specifier = ">=41.0.0" },
{ name = "fastapi", specifier = ">=0.108.0" },
{ name = "flake8", marker = "extra == 'dev'", specifier = ">=7.0.0" },
{ name = "grafana-client", marker = "extra == 'monitoring'", specifier = ">=3.5.0" },
{ name = "httpx", specifier = ">=0.26.0" },
{ name = "isort", marker = "extra == 'dev'", specifier = ">=5.13.0" },
{ name = "jaeger-client", marker = "extra == 'monitoring'", specifier = ">=4.8.0" },
{ name = "mcp", specifier = ">=1.8.0,<2.0.0" },
{ name = "mypy", marker = "extra == 'dev'", specifier = ">=1.8.0" },
{ name = "myst-parser", marker = "extra == 'dev'", specifier = ">=2.0.0" },
{ name = "myst-parser", marker = "extra == 'docs'", specifier = ">=2.0.0" },
{ name = "numpy", specifier = ">=1.24.0" },
{ name = "opentelemetry-api", marker = "extra == 'monitoring'", specifier = ">=1.21.0" },
{ name = "opentelemetry-sdk", marker = "extra == 'monitoring'", specifier = ">=1.21.0" },
{ name = "orjson", specifier = ">=3.9.0" },
{ name = "orjson", marker = "extra == 'performance'", specifier = ">=3.9.0" },
{ name = "pandas", specifier = ">=2.0.0" },
{ name = "passlib", extras = ["bcrypt"], specifier = ">=1.7.0" },
{ name = "pre-commit", marker = "extra == 'dev'", specifier = ">=3.6.0" },
{ name = "prometheus-client", specifier = ">=0.19.0" },
{ name = "prometheus-client", marker = "extra == 'monitoring'", specifier = ">=0.19.0" },
{ name = "pydantic", specifier = ">=2.5.0" },
{ name = "pydantic-settings", specifier = ">=2.1.0" },
{ name = "pyjwt", specifier = ">=2.8.0" },
{ name = "pymysql", specifier = ">=1.1.0" },
{ name = "pytest", specifier = ">=8.4.0" },
{ name = "pytest", marker = "extra == 'dev'", specifier = ">=7.4.0" },
{ name = "pytest-asyncio", specifier = ">=1.0.0" },
{ name = "pytest-asyncio", marker = "extra == 'dev'", specifier = ">=0.23.0" },
{ name = "pytest-cov", specifier = ">=6.1.1" },
{ name = "pytest-cov", marker = "extra == 'dev'", specifier = ">=4.1.0" },
{ name = "pytest-mock", marker = "extra == 'dev'", specifier = ">=3.12.0" },
{ name = "pytest-xdist", marker = "extra == 'dev'", specifier = ">=3.5.0" },
{ name = "python-dateutil", specifier = ">=2.8.0" },
{ name = "python-dotenv", specifier = ">=1.0.0" },
{ name = "python-jose", extras = ["cryptography"], specifier = ">=3.3.0" },
{ name = "python-multipart", specifier = ">=0.0.6" },
{ name = "pyyaml", specifier = ">=6.0.0" },
{ name = "requests", specifier = ">=2.31.0" },
{ name = "rich", specifier = ">=13.7.0" },
{ name = "ruff", marker = "extra == 'dev'", specifier = ">=0.1.0" },
{ name = "safety", marker = "extra == 'dev'", specifier = ">=2.3.0" },
{ name = "sphinx", marker = "extra == 'dev'", specifier = ">=7.2.0" },
{ name = "sphinx", marker = "extra == 'docs'", specifier = ">=7.2.0" },
{ name = "sphinx-autoapi", marker = "extra == 'docs'", specifier = ">=3.0.0" },
{ name = "sphinx-rtd-theme", marker = "extra == 'dev'", specifier = ">=2.0.0" },
{ name = "sphinx-rtd-theme", marker = "extra == 'docs'", specifier = ">=2.0.0" },
{ name = "sqlparse", specifier = ">=0.4.4" },
{ name = "starlette", specifier = ">=0.27.0" },
{ name = "structlog", specifier = ">=23.2.0" },
{ name = "toml", specifier = ">=0.10.0" },
{ name = "tox", marker = "extra == 'dev'", specifier = ">=4.11.0" },
{ name = "tqdm", specifier = ">=4.66.0" },
{ name = "typer", specifier = ">=0.9.0" },
{ name = "uvicorn", extras = ["standard"], specifier = ">=0.25.0" },
{ name = "uvloop", marker = "extra == 'performance'", specifier = ">=0.19.0" },
{ name = "websockets", specifier = ">=12.0" },
]
provides-extras = ["dev", "docs", "performance", "monitoring"]
[package.metadata.requires-dev]
dev = [{ name = "ruff", specifier = ">=0.11.13" }]
[[package]]
name = "mdit-py-plugins"
version = "0.4.2"
@@ -1605,6 +1662,41 @@ wheels = [
{ url = "https://files.pythonhosted.org/packages/7b/d7/7831438e6c3ebbfa6e01a927127a6cb42ad3ab844247f3c5b96bea25d73d/psutil-6.1.1-cp37-abi3-win_amd64.whl", hash = "sha256:f35cfccb065fff93529d2afb4a2e89e363fe63ca1e4a5da22b603a85833c2649", size = 254444 },
]
[[package]]
name = "pyarrow"
version = "20.0.0"
source = { registry = "https://pypi.org/simple" }
sdist = { url = "https://files.pythonhosted.org/packages/a2/ee/a7810cb9f3d6e9238e61d312076a9859bf3668fd21c69744de9532383912/pyarrow-20.0.0.tar.gz", hash = "sha256:febc4a913592573c8d5805091a6c2b5064c8bd6e002131f01061797d91c783c1", size = 1125187 }
wheels = [
{ url = "https://files.pythonhosted.org/packages/a1/d6/0c10e0d54f6c13eb464ee9b67a68b8c71bcf2f67760ef5b6fbcddd2ab05f/pyarrow-20.0.0-cp312-cp312-macosx_12_0_arm64.whl", hash = "sha256:75a51a5b0eef32727a247707d4755322cb970be7e935172b6a3a9f9ae98404ba", size = 30815067 },
{ url = "https://files.pythonhosted.org/packages/7e/e2/04e9874abe4094a06fd8b0cbb0f1312d8dd7d707f144c2ec1e5e8f452ffa/pyarrow-20.0.0-cp312-cp312-macosx_12_0_x86_64.whl", hash = "sha256:211d5e84cecc640c7a3ab900f930aaff5cd2702177e0d562d426fb7c4f737781", size = 32297128 },
{ url = "https://files.pythonhosted.org/packages/31/fd/c565e5dcc906a3b471a83273039cb75cb79aad4a2d4a12f76cc5ae90a4b8/pyarrow-20.0.0-cp312-cp312-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:4ba3cf4182828be7a896cbd232aa8dd6a31bd1f9e32776cc3796c012855e1199", size = 41334890 },
{ url = "https://files.pythonhosted.org/packages/af/a9/3bdd799e2c9b20c1ea6dc6fa8e83f29480a97711cf806e823f808c2316ac/pyarrow-20.0.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:2c3a01f313ffe27ac4126f4c2e5ea0f36a5fc6ab51f8726cf41fee4b256680bd", size = 42421775 },
{ url = "https://files.pythonhosted.org/packages/10/f7/da98ccd86354c332f593218101ae56568d5dcedb460e342000bd89c49cc1/pyarrow-20.0.0-cp312-cp312-manylinux_2_28_aarch64.whl", hash = "sha256:a2791f69ad72addd33510fec7bb14ee06c2a448e06b649e264c094c5b5f7ce28", size = 40687231 },
{ url = "https://files.pythonhosted.org/packages/bb/1b/2168d6050e52ff1e6cefc61d600723870bf569cbf41d13db939c8cf97a16/pyarrow-20.0.0-cp312-cp312-manylinux_2_28_x86_64.whl", hash = "sha256:4250e28a22302ce8692d3a0e8ec9d9dde54ec00d237cff4dfa9c1fbf79e472a8", size = 42295639 },
{ url = "https://files.pythonhosted.org/packages/b2/66/2d976c0c7158fd25591c8ca55aee026e6d5745a021915a1835578707feb3/pyarrow-20.0.0-cp312-cp312-musllinux_1_2_aarch64.whl", hash = "sha256:89e030dc58fc760e4010148e6ff164d2f44441490280ef1e97a542375e41058e", size = 42908549 },
{ url = "https://files.pythonhosted.org/packages/31/a9/dfb999c2fc6911201dcbf348247f9cc382a8990f9ab45c12eabfd7243a38/pyarrow-20.0.0-cp312-cp312-musllinux_1_2_x86_64.whl", hash = "sha256:6102b4864d77102dbbb72965618e204e550135a940c2534711d5ffa787df2a5a", size = 44557216 },
{ url = "https://files.pythonhosted.org/packages/a0/8e/9adee63dfa3911be2382fb4d92e4b2e7d82610f9d9f668493bebaa2af50f/pyarrow-20.0.0-cp312-cp312-win_amd64.whl", hash = "sha256:96d6a0a37d9c98be08f5ed6a10831d88d52cac7b13f5287f1e0f625a0de8062b", size = 25660496 },
{ url = "https://files.pythonhosted.org/packages/9b/aa/daa413b81446d20d4dad2944110dcf4cf4f4179ef7f685dd5a6d7570dc8e/pyarrow-20.0.0-cp313-cp313-macosx_12_0_arm64.whl", hash = "sha256:a15532e77b94c61efadde86d10957950392999503b3616b2ffcef7621a002893", size = 30798501 },
{ url = "https://files.pythonhosted.org/packages/ff/75/2303d1caa410925de902d32ac215dc80a7ce7dd8dfe95358c165f2adf107/pyarrow-20.0.0-cp313-cp313-macosx_12_0_x86_64.whl", hash = "sha256:dd43f58037443af715f34f1322c782ec463a3c8a94a85fdb2d987ceb5658e061", size = 32277895 },
{ url = "https://files.pythonhosted.org/packages/92/41/fe18c7c0b38b20811b73d1bdd54b1fccba0dab0e51d2048878042d84afa8/pyarrow-20.0.0-cp313-cp313-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:aa0d288143a8585806e3cc7c39566407aab646fb9ece164609dac1cfff45f6ae", size = 41327322 },
{ url = "https://files.pythonhosted.org/packages/da/ab/7dbf3d11db67c72dbf36ae63dcbc9f30b866c153b3a22ef728523943eee6/pyarrow-20.0.0-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:b6953f0114f8d6f3d905d98e987d0924dabce59c3cda380bdfaa25a6201563b4", size = 42411441 },
{ url = "https://files.pythonhosted.org/packages/90/c3/0c7da7b6dac863af75b64e2f827e4742161128c350bfe7955b426484e226/pyarrow-20.0.0-cp313-cp313-manylinux_2_28_aarch64.whl", hash = "sha256:991f85b48a8a5e839b2128590ce07611fae48a904cae6cab1f089c5955b57eb5", size = 40677027 },
{ url = "https://files.pythonhosted.org/packages/be/27/43a47fa0ff9053ab5203bb3faeec435d43c0d8bfa40179bfd076cdbd4e1c/pyarrow-20.0.0-cp313-cp313-manylinux_2_28_x86_64.whl", hash = "sha256:97c8dc984ed09cb07d618d57d8d4b67a5100a30c3818c2fb0b04599f0da2de7b", size = 42281473 },
{ url = "https://files.pythonhosted.org/packages/bc/0b/d56c63b078876da81bbb9ba695a596eabee9b085555ed12bf6eb3b7cab0e/pyarrow-20.0.0-cp313-cp313-musllinux_1_2_aarch64.whl", hash = "sha256:9b71daf534f4745818f96c214dbc1e6124d7daf059167330b610fc69b6f3d3e3", size = 42893897 },
{ url = "https://files.pythonhosted.org/packages/92/ac/7d4bd020ba9145f354012838692d48300c1b8fe5634bfda886abcada67ed/pyarrow-20.0.0-cp313-cp313-musllinux_1_2_x86_64.whl", hash = "sha256:e8b88758f9303fa5a83d6c90e176714b2fd3852e776fc2d7e42a22dd6c2fb368", size = 44543847 },
{ url = "https://files.pythonhosted.org/packages/9d/07/290f4abf9ca702c5df7b47739c1b2c83588641ddfa2cc75e34a301d42e55/pyarrow-20.0.0-cp313-cp313-win_amd64.whl", hash = "sha256:30b3051b7975801c1e1d387e17c588d8ab05ced9b1e14eec57915f79869b5031", size = 25653219 },
{ url = "https://files.pythonhosted.org/packages/95/df/720bb17704b10bd69dde086e1400b8eefb8f58df3f8ac9cff6c425bf57f1/pyarrow-20.0.0-cp313-cp313t-macosx_12_0_arm64.whl", hash = "sha256:ca151afa4f9b7bc45bcc791eb9a89e90a9eb2772767d0b1e5389609c7d03db63", size = 30853957 },
{ url = "https://files.pythonhosted.org/packages/d9/72/0d5f875efc31baef742ba55a00a25213a19ea64d7176e0fe001c5d8b6e9a/pyarrow-20.0.0-cp313-cp313t-macosx_12_0_x86_64.whl", hash = "sha256:4680f01ecd86e0dd63e39eb5cd59ef9ff24a9d166db328679e36c108dc993d4c", size = 32247972 },
{ url = "https://files.pythonhosted.org/packages/d5/bc/e48b4fa544d2eea72f7844180eb77f83f2030b84c8dad860f199f94307ed/pyarrow-20.0.0-cp313-cp313t-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:7f4c8534e2ff059765647aa69b75d6543f9fef59e2cd4c6d18015192565d2b70", size = 41256434 },
{ url = "https://files.pythonhosted.org/packages/c3/01/974043a29874aa2cf4f87fb07fd108828fc7362300265a2a64a94965e35b/pyarrow-20.0.0-cp313-cp313t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:3e1f8a47f4b4ae4c69c4d702cfbdfe4d41e18e5c7ef6f1bb1c50918c1e81c57b", size = 42353648 },
{ url = "https://files.pythonhosted.org/packages/68/95/cc0d3634cde9ca69b0e51cbe830d8915ea32dda2157560dda27ff3b3337b/pyarrow-20.0.0-cp313-cp313t-manylinux_2_28_aarch64.whl", hash = "sha256:a1f60dc14658efaa927f8214734f6a01a806d7690be4b3232ba526836d216122", size = 40619853 },
{ url = "https://files.pythonhosted.org/packages/29/c2/3ad40e07e96a3e74e7ed7cc8285aadfa84eb848a798c98ec0ad009eb6bcc/pyarrow-20.0.0-cp313-cp313t-manylinux_2_28_x86_64.whl", hash = "sha256:204a846dca751428991346976b914d6d2a82ae5b8316a6ed99789ebf976551e6", size = 42241743 },
{ url = "https://files.pythonhosted.org/packages/eb/cb/65fa110b483339add6a9bc7b6373614166b14e20375d4daa73483755f830/pyarrow-20.0.0-cp313-cp313t-musllinux_1_2_aarch64.whl", hash = "sha256:f3b117b922af5e4c6b9a9115825726cac7d8b1421c37c2b5e24fbacc8930612c", size = 42839441 },
{ url = "https://files.pythonhosted.org/packages/98/7b/f30b1954589243207d7a0fbc9997401044bf9a033eec78f6cb50da3f304a/pyarrow-20.0.0-cp313-cp313t-musllinux_1_2_x86_64.whl", hash = "sha256:e724a3fd23ae5b9c010e7be857f4405ed5e679db5c93e66204db1a69f733936a", size = 44503279 },
{ url = "https://files.pythonhosted.org/packages/37/40/ad395740cd641869a13bcf60851296c89624662575621968dcfafabaa7f6/pyarrow-20.0.0-cp313-cp313t-win_amd64.whl", hash = "sha256:82f1ee5133bd8f49d31be1299dc07f585136679666b502540db854968576faf9", size = 25944982 },
]
[[package]]
name = "pyasn1"
version = "0.6.1"