Files

Ye Shijie db0e5965ec init

2025-09-26 17:15:54 +08:00

6.0 KiB

Raw Blame History

Conversation History Management

Overview

The system now automatically manages conversation history to prevent exceeding LLM context length limits. This ensures reliable operation for long-running conversations and prevents API failures due to token limit violations.

Key Features

Automatic Context Management

Token-based trimming: Uses LangChain's trim_messages utility for intelligent conversation truncation
Configurable limits: Defaults to 70% of max_tokens for conversation history (30% reserved for responses)
Smart preservation: Always preserves system messages and maintains conversation validity

Conversation Quality

Valid flow: Ensures conversations start with human messages and end with human/tool messages
Recent priority: Keeps the most recent messages when trimming is needed
Graceful fallback: Falls back to message count-based trimming if token counting fails

Configuration

Default Settings

llm:
  rag:
    max_context_length: 96000    # Maximum context length for conversation history
    # max_output_tokens:         # Optional: Limit LLM output tokens (default: no limit)
    # Conversation history will use 85% = 81,600 tokens
    # Response generation reserves 15% = 14,400 tokens

Custom Configuration

You can override the context length and optionally set output token limits:

from service.graph.message_trimmer import create_conversation_trimmer

# Use custom context length
trimmer = create_conversation_trimmer(max_context_length=128000)

Configuration examples:

# No output limit (default)
llm:
  rag:
    max_context_length: 96000

# With output limit
llm:
  rag:
    max_context_length: 96000
    max_output_tokens: 4000      # Limit LLM response to 4000 tokens

How It Works

1. Token Monitoring

The system continuously monitors conversation length using approximate token counting.

2. Trimming Logic

When the conversation approaches the token limit:

Preserves the system message (contains important instructions)
Keeps the most recent conversation turns
Removes older messages to stay within limits
Maintains conversation validity (proper message sequence)

3. Fallback Strategy

If token counting fails:

Falls back to message count-based trimming
Keeps last 20 messages by default
Still preserves system messages

Implementation Details

Core Components

ConversationTrimmer Class

class ConversationTrimmer:
    def __init__(self, max_context_length: int = 96000, preserve_system: bool = True)
    
    def should_trim(self, messages) -> bool
    def trim_conversation_history(self, messages) -> List[BaseMessage]

Integration Point

The trimming is automatically applied in the call_model function:

# Create conversation trimmer for managing context length
trimmer = create_conversation_trimmer()

# Trim conversation history to manage context length
if trimmer.should_trim(messages):
    messages = trimmer.trim_conversation_history(messages)
    logger.info("Applied conversation history trimming for context management")

Token Allocation Strategy

Component	Token Allocation	Purpose
Conversation History	85% (81,600 tokens)	Maintains context
Response Generation	15% (14,400 tokens)	LLM output space

Benefits

Reliability

No more context overflow: Prevents API failures due to token limits
Consistent performance: Maintains response quality regardless of conversation length
Graceful degradation: Intelligent trimming preserves conversation flow

User Experience

Seamless operation: Trimming happens transparently
Context preservation: Important system instructions always maintained
Recent focus: Most relevant (recent) conversation content preserved

Scalability

Long conversations: Supports indefinitely long conversations
Memory efficiency: Prevents unbounded memory growth
Performance: Minimal overhead for short conversations

Monitoring

Logging

The system logs when trimming occurs:

INFO: Trimmed conversation history: 15 -> 8 messages
INFO: Applied conversation history trimming for context management

Metrics

Original message count vs. trimmed count
Token count estimation
Fallback usage frequency

Best Practices

For Administrators

Monitor logs: Watch for frequent trimming (may indicate need for higher limits)
Tune limits: Adjust max_tokens based on your LLM provider's limits
Test with long conversations: Verify trimming behavior with realistic scenarios

For Developers

System prompt optimization: Keep system prompts concise to maximize conversation space
Tool response size: Consider tool response sizes in token calculations
Custom trimming: Implement domain-specific trimming logic if needed

Troubleshooting

Common Issues

"Trimming too aggressive"

Increase max_tokens in configuration
Check if system prompt is too long
Verify tool responses aren't excessively large

"Still getting context errors"

Check if token counting is accurate for your model
Verify trimming is actually being applied (check logs)
Consider implementing custom token counting for specific models

"Important context lost"

Review trimming strategy (currently keeps recent messages)
Consider implementing conversation summarization for older content
Adjust token allocation percentages

Future Enhancements

Planned Features

Conversation summarization: Summarize older parts instead of discarding
Smart context selection: Preserve important messages based on content
Model-specific optimization: Tailored trimming for different LLM providers
Adaptive limits: Dynamic token allocation based on conversation patterns

Configuration Extensions

Per-session limits: Different limits for different conversation types
Priority tagging: Mark important messages for preservation
Custom strategies: Pluggable trimming algorithms

6.0 KiB Raw Blame History