6.0 KiB
6.0 KiB
Conversation History Management
Overview
The system now automatically manages conversation history to prevent exceeding LLM context length limits. This ensures reliable operation for long-running conversations and prevents API failures due to token limit violations.
Key Features
Automatic Context Management
- Token-based trimming: Uses LangChain's
trim_messagesutility for intelligent conversation truncation - Configurable limits: Defaults to 70% of max_tokens for conversation history (30% reserved for responses)
- Smart preservation: Always preserves system messages and maintains conversation validity
Conversation Quality
- Valid flow: Ensures conversations start with human messages and end with human/tool messages
- Recent priority: Keeps the most recent messages when trimming is needed
- Graceful fallback: Falls back to message count-based trimming if token counting fails
Configuration
Default Settings
llm:
rag:
max_context_length: 96000 # Maximum context length for conversation history
# max_output_tokens: # Optional: Limit LLM output tokens (default: no limit)
# Conversation history will use 85% = 81,600 tokens
# Response generation reserves 15% = 14,400 tokens
Custom Configuration
You can override the context length and optionally set output token limits:
from service.graph.message_trimmer import create_conversation_trimmer
# Use custom context length
trimmer = create_conversation_trimmer(max_context_length=128000)
Configuration examples:
# No output limit (default)
llm:
rag:
max_context_length: 96000
# With output limit
llm:
rag:
max_context_length: 96000
max_output_tokens: 4000 # Limit LLM response to 4000 tokens
How It Works
1. Token Monitoring
The system continuously monitors conversation length using approximate token counting.
2. Trimming Logic
When the conversation approaches the token limit:
- Preserves the system message (contains important instructions)
- Keeps the most recent conversation turns
- Removes older messages to stay within limits
- Maintains conversation validity (proper message sequence)
3. Fallback Strategy
If token counting fails:
- Falls back to message count-based trimming
- Keeps last 20 messages by default
- Still preserves system messages
Implementation Details
Core Components
ConversationTrimmer Class
class ConversationTrimmer:
def __init__(self, max_context_length: int = 96000, preserve_system: bool = True)
def should_trim(self, messages) -> bool
def trim_conversation_history(self, messages) -> List[BaseMessage]
Integration Point
The trimming is automatically applied in the call_model function:
# Create conversation trimmer for managing context length
trimmer = create_conversation_trimmer()
# Trim conversation history to manage context length
if trimmer.should_trim(messages):
messages = trimmer.trim_conversation_history(messages)
logger.info("Applied conversation history trimming for context management")
Token Allocation Strategy
| Component | Token Allocation | Purpose |
|---|---|---|
| Conversation History | 85% (81,600 tokens) | Maintains context |
| Response Generation | 15% (14,400 tokens) | LLM output space |
Benefits
Reliability
- No more context overflow: Prevents API failures due to token limits
- Consistent performance: Maintains response quality regardless of conversation length
- Graceful degradation: Intelligent trimming preserves conversation flow
User Experience
- Seamless operation: Trimming happens transparently
- Context preservation: Important system instructions always maintained
- Recent focus: Most relevant (recent) conversation content preserved
Scalability
- Long conversations: Supports indefinitely long conversations
- Memory efficiency: Prevents unbounded memory growth
- Performance: Minimal overhead for short conversations
Monitoring
Logging
The system logs when trimming occurs:
INFO: Trimmed conversation history: 15 -> 8 messages
INFO: Applied conversation history trimming for context management
Metrics
- Original message count vs. trimmed count
- Token count estimation
- Fallback usage frequency
Best Practices
For Administrators
- Monitor logs: Watch for frequent trimming (may indicate need for higher limits)
- Tune limits: Adjust
max_tokensbased on your LLM provider's limits - Test with long conversations: Verify trimming behavior with realistic scenarios
For Developers
- System prompt optimization: Keep system prompts concise to maximize conversation space
- Tool response size: Consider tool response sizes in token calculations
- Custom trimming: Implement domain-specific trimming logic if needed
Troubleshooting
Common Issues
"Trimming too aggressive"
- Increase
max_tokensin configuration - Check if system prompt is too long
- Verify tool responses aren't excessively large
"Still getting context errors"
- Check if token counting is accurate for your model
- Verify trimming is actually being applied (check logs)
- Consider implementing custom token counting for specific models
"Important context lost"
- Review trimming strategy (currently keeps recent messages)
- Consider implementing conversation summarization for older content
- Adjust token allocation percentages
Future Enhancements
Planned Features
- Conversation summarization: Summarize older parts instead of discarding
- Smart context selection: Preserve important messages based on content
- Model-specific optimization: Tailored trimming for different LLM providers
- Adaptive limits: Dynamic token allocation based on conversation patterns
Configuration Extensions
- Per-session limits: Different limits for different conversation types
- Priority tagging: Mark important messages for preservation
- Custom strategies: Pluggable trimming algorithms