180 lines
6.0 KiB
Markdown
180 lines
6.0 KiB
Markdown
|
|
# Conversation History Management
|
||
|
|
|
||
|
|
## Overview
|
||
|
|
|
||
|
|
The system now automatically manages conversation history to prevent exceeding LLM context length limits. This ensures reliable operation for long-running conversations and prevents API failures due to token limit violations.
|
||
|
|
|
||
|
|
## Key Features
|
||
|
|
|
||
|
|
### Automatic Context Management
|
||
|
|
- **Token-based trimming**: Uses LangChain's `trim_messages` utility for intelligent conversation truncation
|
||
|
|
- **Configurable limits**: Defaults to 70% of max_tokens for conversation history (30% reserved for responses)
|
||
|
|
- **Smart preservation**: Always preserves system messages and maintains conversation validity
|
||
|
|
|
||
|
|
### Conversation Quality
|
||
|
|
- **Valid flow**: Ensures conversations start with human messages and end with human/tool messages
|
||
|
|
- **Recent priority**: Keeps the most recent messages when trimming is needed
|
||
|
|
- **Graceful fallback**: Falls back to message count-based trimming if token counting fails
|
||
|
|
|
||
|
|
## Configuration
|
||
|
|
|
||
|
|
### Default Settings
|
||
|
|
```yaml
|
||
|
|
llm:
|
||
|
|
rag:
|
||
|
|
max_context_length: 96000 # Maximum context length for conversation history
|
||
|
|
# max_output_tokens: # Optional: Limit LLM output tokens (default: no limit)
|
||
|
|
# Conversation history will use 85% = 81,600 tokens
|
||
|
|
# Response generation reserves 15% = 14,400 tokens
|
||
|
|
```
|
||
|
|
|
||
|
|
### Custom Configuration
|
||
|
|
You can override the context length and optionally set output token limits:
|
||
|
|
|
||
|
|
```python
|
||
|
|
from service.graph.message_trimmer import create_conversation_trimmer
|
||
|
|
|
||
|
|
# Use custom context length
|
||
|
|
trimmer = create_conversation_trimmer(max_context_length=128000)
|
||
|
|
```
|
||
|
|
|
||
|
|
Configuration examples:
|
||
|
|
```yaml
|
||
|
|
# No output limit (default)
|
||
|
|
llm:
|
||
|
|
rag:
|
||
|
|
max_context_length: 96000
|
||
|
|
|
||
|
|
# With output limit
|
||
|
|
llm:
|
||
|
|
rag:
|
||
|
|
max_context_length: 96000
|
||
|
|
max_output_tokens: 4000 # Limit LLM response to 4000 tokens
|
||
|
|
```
|
||
|
|
|
||
|
|
## How It Works
|
||
|
|
|
||
|
|
### 1. Token Monitoring
|
||
|
|
The system continuously monitors conversation length using approximate token counting.
|
||
|
|
|
||
|
|
### 2. Trimming Logic
|
||
|
|
When the conversation approaches the token limit:
|
||
|
|
- Preserves the system message (contains important instructions)
|
||
|
|
- Keeps the most recent conversation turns
|
||
|
|
- Removes older messages to stay within limits
|
||
|
|
- Maintains conversation validity (proper message sequence)
|
||
|
|
|
||
|
|
### 3. Fallback Strategy
|
||
|
|
If token counting fails:
|
||
|
|
- Falls back to message count-based trimming
|
||
|
|
- Keeps last 20 messages by default
|
||
|
|
- Still preserves system messages
|
||
|
|
|
||
|
|
## Implementation Details
|
||
|
|
|
||
|
|
### Core Components
|
||
|
|
|
||
|
|
#### ConversationTrimmer Class
|
||
|
|
```python
|
||
|
|
class ConversationTrimmer:
|
||
|
|
def __init__(self, max_context_length: int = 96000, preserve_system: bool = True)
|
||
|
|
|
||
|
|
def should_trim(self, messages) -> bool
|
||
|
|
def trim_conversation_history(self, messages) -> List[BaseMessage]
|
||
|
|
```
|
||
|
|
|
||
|
|
#### Integration Point
|
||
|
|
The trimming is automatically applied in the `call_model` function:
|
||
|
|
|
||
|
|
```python
|
||
|
|
# Create conversation trimmer for managing context length
|
||
|
|
trimmer = create_conversation_trimmer()
|
||
|
|
|
||
|
|
# Trim conversation history to manage context length
|
||
|
|
if trimmer.should_trim(messages):
|
||
|
|
messages = trimmer.trim_conversation_history(messages)
|
||
|
|
logger.info("Applied conversation history trimming for context management")
|
||
|
|
```
|
||
|
|
|
||
|
|
### Token Allocation Strategy
|
||
|
|
|
||
|
|
| Component | Token Allocation | Purpose |
|
||
|
|
|-----------|------------------|---------|
|
||
|
|
| Conversation History | 85% (81,600 tokens) | Maintains context |
|
||
|
|
| Response Generation | 15% (14,400 tokens) | LLM output space |
|
||
|
|
|
||
|
|
## Benefits
|
||
|
|
|
||
|
|
### Reliability
|
||
|
|
- **No more context overflow**: Prevents API failures due to token limits
|
||
|
|
- **Consistent performance**: Maintains response quality regardless of conversation length
|
||
|
|
- **Graceful degradation**: Intelligent trimming preserves conversation flow
|
||
|
|
|
||
|
|
### User Experience
|
||
|
|
- **Seamless operation**: Trimming happens transparently
|
||
|
|
- **Context preservation**: Important system instructions always maintained
|
||
|
|
- **Recent focus**: Most relevant (recent) conversation content preserved
|
||
|
|
|
||
|
|
### Scalability
|
||
|
|
- **Long conversations**: Supports indefinitely long conversations
|
||
|
|
- **Memory efficiency**: Prevents unbounded memory growth
|
||
|
|
- **Performance**: Minimal overhead for short conversations
|
||
|
|
|
||
|
|
## Monitoring
|
||
|
|
|
||
|
|
### Logging
|
||
|
|
The system logs when trimming occurs:
|
||
|
|
```
|
||
|
|
INFO: Trimmed conversation history: 15 -> 8 messages
|
||
|
|
INFO: Applied conversation history trimming for context management
|
||
|
|
```
|
||
|
|
|
||
|
|
### Metrics
|
||
|
|
- Original message count vs. trimmed count
|
||
|
|
- Token count estimation
|
||
|
|
- Fallback usage frequency
|
||
|
|
|
||
|
|
## Best Practices
|
||
|
|
|
||
|
|
### For Administrators
|
||
|
|
1. **Monitor logs**: Watch for frequent trimming (may indicate need for higher limits)
|
||
|
|
2. **Tune limits**: Adjust `max_tokens` based on your LLM provider's limits
|
||
|
|
3. **Test with long conversations**: Verify trimming behavior with realistic scenarios
|
||
|
|
|
||
|
|
### For Developers
|
||
|
|
1. **System prompt optimization**: Keep system prompts concise to maximize conversation space
|
||
|
|
2. **Tool response size**: Consider tool response sizes in token calculations
|
||
|
|
3. **Custom trimming**: Implement domain-specific trimming logic if needed
|
||
|
|
|
||
|
|
## Troubleshooting
|
||
|
|
|
||
|
|
### Common Issues
|
||
|
|
|
||
|
|
#### "Trimming too aggressive"
|
||
|
|
- Increase `max_tokens` in configuration
|
||
|
|
- Check if system prompt is too long
|
||
|
|
- Verify tool responses aren't excessively large
|
||
|
|
|
||
|
|
#### "Still getting context errors"
|
||
|
|
- Check if token counting is accurate for your model
|
||
|
|
- Verify trimming is actually being applied (check logs)
|
||
|
|
- Consider implementing custom token counting for specific models
|
||
|
|
|
||
|
|
#### "Important context lost"
|
||
|
|
- Review trimming strategy (currently keeps recent messages)
|
||
|
|
- Consider implementing conversation summarization for older content
|
||
|
|
- Adjust token allocation percentages
|
||
|
|
|
||
|
|
## Future Enhancements
|
||
|
|
|
||
|
|
### Planned Features
|
||
|
|
1. **Conversation summarization**: Summarize older parts instead of discarding
|
||
|
|
2. **Smart context selection**: Preserve important messages based on content
|
||
|
|
3. **Model-specific optimization**: Tailored trimming for different LLM providers
|
||
|
|
4. **Adaptive limits**: Dynamic token allocation based on conversation patterns
|
||
|
|
|
||
|
|
### Configuration Extensions
|
||
|
|
1. **Per-session limits**: Different limits for different conversation types
|
||
|
|
2. **Priority tagging**: Mark important messages for preservation
|
||
|
|
3. **Custom strategies**: Pluggable trimming algorithms
|