catonline_ai/vw-agentic-rag/docs/topics/CONVERSATION_HISTORY_MANAGEMENT.md

# Conversation History Management

## Overview

The system now automatically manages conversation history to prevent exceeding LLM context length limits. This ensures reliable operation for long-running conversations and prevents API failures due to token limit violations.

## Key Features

### Automatic Context Management
- **Token-based trimming**: Uses LangChain's `trim_messages` utility for intelligent conversation truncation
- **Configurable limits**: Defaults to 70% of max_tokens for conversation history (30% reserved for responses)
- **Smart preservation**: Always preserves system messages and maintains conversation validity

### Conversation Quality
- **Valid flow**: Ensures conversations start with human messages and end with human/tool messages
- **Recent priority**: Keeps the most recent messages when trimming is needed
- **Graceful fallback**: Falls back to message count-based trimming if token counting fails

## Configuration

### Default Settings
```yaml
llm:
  rag:
    max_context_length: 96000    # Maximum context length for conversation history
    # max_output_tokens:         # Optional: Limit LLM output tokens (default: no limit)
    # Conversation history will use 85% = 81,600 tokens
    # Response generation reserves 15% = 14,400 tokens
```

### Custom Configuration
You can override the context length and optionally set output token limits:

```python
from service.graph.message_trimmer import create_conversation_trimmer

# Use custom context length
trimmer = create_conversation_trimmer(max_context_length=128000)
```

Configuration examples:
```yaml
# No output limit (default)
llm:
  rag:
    max_context_length: 96000

# With output limit
llm:
  rag:
    max_context_length: 96000
    max_output_tokens: 4000      # Limit LLM response to 4000 tokens
```

## How It Works

### 1. Token Monitoring
The system continuously monitors conversation length using approximate token counting.

### 2. Trimming Logic
When the conversation approaches the token limit:
- Preserves the system message (contains important instructions)
- Keeps the most recent conversation turns
- Removes older messages to stay within limits
- Maintains conversation validity (proper message sequence)

### 3. Fallback Strategy
If token counting fails:
- Falls back to message count-based trimming
- Keeps last 20 messages by default
- Still preserves system messages

## Implementation Details

### Core Components

#### ConversationTrimmer Class
```python
class ConversationTrimmer:
    def __init__(self, max_context_length: int = 96000, preserve_system: bool = True)
    
    def should_trim(self, messages) -> bool
    def trim_conversation_history(self, messages) -> List[BaseMessage]
```

#### Integration Point
The trimming is automatically applied in the `call_model` function:

```python
# Create conversation trimmer for managing context length
trimmer = create_conversation_trimmer()

# Trim conversation history to manage context length
if trimmer.should_trim(messages):
    messages = trimmer.trim_conversation_history(messages)
    logger.info("Applied conversation history trimming for context management")
```

### Token Allocation Strategy

| Component | Token Allocation | Purpose |
|-----------|------------------|---------|
| Conversation History | 85% (81,600 tokens) | Maintains context |
| Response Generation | 15% (14,400 tokens) | LLM output space |

## Benefits

### Reliability
- **No more context overflow**: Prevents API failures due to token limits
- **Consistent performance**: Maintains response quality regardless of conversation length
- **Graceful degradation**: Intelligent trimming preserves conversation flow

### User Experience
- **Seamless operation**: Trimming happens transparently
- **Context preservation**: Important system instructions always maintained
- **Recent focus**: Most relevant (recent) conversation content preserved

### Scalability
- **Long conversations**: Supports indefinitely long conversations
- **Memory efficiency**: Prevents unbounded memory growth
- **Performance**: Minimal overhead for short conversations

## Monitoring

### Logging
The system logs when trimming occurs:
```
INFO: Trimmed conversation history: 15 -> 8 messages
INFO: Applied conversation history trimming for context management
```

### Metrics
- Original message count vs. trimmed count
- Token count estimation
- Fallback usage frequency

## Best Practices

### For Administrators
1. **Monitor logs**: Watch for frequent trimming (may indicate need for higher limits)
2. **Tune limits**: Adjust `max_tokens` based on your LLM provider's limits
3. **Test with long conversations**: Verify trimming behavior with realistic scenarios

### For Developers
1. **System prompt optimization**: Keep system prompts concise to maximize conversation space
2. **Tool response size**: Consider tool response sizes in token calculations
3. **Custom trimming**: Implement domain-specific trimming logic if needed

## Troubleshooting

### Common Issues

#### "Trimming too aggressive"
- Increase `max_tokens` in configuration
- Check if system prompt is too long
- Verify tool responses aren't excessively large

#### "Still getting context errors"
- Check if token counting is accurate for your model
- Verify trimming is actually being applied (check logs)
- Consider implementing custom token counting for specific models

#### "Important context lost"
- Review trimming strategy (currently keeps recent messages)
- Consider implementing conversation summarization for older content
- Adjust token allocation percentages

## Future Enhancements

### Planned Features
1. **Conversation summarization**: Summarize older parts instead of discarding
2. **Smart context selection**: Preserve important messages based on content
3. **Model-specific optimization**: Tailored trimming for different LLM providers
4. **Adaptive limits**: Dynamic token allocation based on conversation patterns

### Configuration Extensions
1. **Per-session limits**: Different limits for different conversation types
2. **Priority tagging**: Mark important messages for preservation
3. **Custom strategies**: Pluggable trimming algorithms
init 2025-09-26 17:15:54 +08:00			`# Conversation History Management`

			`## Overview`

			`The system now automatically manages conversation history to prevent exceeding LLM context length limits. This ensures reliable operation for long-running conversations and prevents API failures due to token limit violations.`

			`## Key Features`

			`### Automatic Context Management`
			- Token-based trimming: Uses LangChain's `trim_messages` utility for intelligent conversation truncation
			`- Configurable limits: Defaults to 70% of max_tokens for conversation history (30% reserved for responses)`
			`- Smart preservation: Always preserves system messages and maintains conversation validity`

			`### Conversation Quality`
			`- Valid flow: Ensures conversations start with human messages and end with human/tool messages`
			`- Recent priority: Keeps the most recent messages when trimming is needed`
			`- Graceful fallback: Falls back to message count-based trimming if token counting fails`

			`## Configuration`

			`### Default Settings`
			```yaml
			`llm:`
			`rag:`
			`max_context_length: 96000 # Maximum context length for conversation history`
			`# max_output_tokens: # Optional: Limit LLM output tokens (default: no limit)`
			`# Conversation history will use 85% = 81,600 tokens`
			`# Response generation reserves 15% = 14,400 tokens`
			```

			`### Custom Configuration`
			`You can override the context length and optionally set output token limits:`

			```python
			`from service.graph.message_trimmer import create_conversation_trimmer`

			`# Use custom context length`
			`trimmer = create_conversation_trimmer(max_context_length=128000)`
			```

			`Configuration examples:`
			```yaml
			`# No output limit (default)`
			`llm:`
			`rag:`
			`max_context_length: 96000`

			`# With output limit`
			`llm:`
			`rag:`
			`max_context_length: 96000`
			`max_output_tokens: 4000 # Limit LLM response to 4000 tokens`
			```

			`## How It Works`

			`### 1. Token Monitoring`
			`The system continuously monitors conversation length using approximate token counting.`

			`### 2. Trimming Logic`
			`When the conversation approaches the token limit:`
			`- Preserves the system message (contains important instructions)`
			`- Keeps the most recent conversation turns`
			`- Removes older messages to stay within limits`
			`- Maintains conversation validity (proper message sequence)`

			`### 3. Fallback Strategy`
			`If token counting fails:`
			`- Falls back to message count-based trimming`
			`- Keeps last 20 messages by default`
			`- Still preserves system messages`

			`## Implementation Details`

			`### Core Components`

			`#### ConversationTrimmer Class`
			```python
			`class ConversationTrimmer:`
			`def __init__(self, max_context_length: int = 96000, preserve_system: bool = True)`

			`def should_trim(self, messages) -> bool`
			`def trim_conversation_history(self, messages) -> List[BaseMessage]`
			```

			`#### Integration Point`
			The trimming is automatically applied in the `call_model` function:

			```python
			`# Create conversation trimmer for managing context length`
			`trimmer = create_conversation_trimmer()`

			`# Trim conversation history to manage context length`
			`if trimmer.should_trim(messages):`
			`messages = trimmer.trim_conversation_history(messages)`
			`logger.info("Applied conversation history trimming for context management")`
			```

			`### Token Allocation Strategy`

			`\| Component \| Token Allocation \| Purpose \|`
			`\|-----------\|------------------\|---------\|`
			`\| Conversation History \| 85% (81,600 tokens) \| Maintains context \|`
			`\| Response Generation \| 15% (14,400 tokens) \| LLM output space \|`

			`## Benefits`

			`### Reliability`
			`- No more context overflow: Prevents API failures due to token limits`
			`- Consistent performance: Maintains response quality regardless of conversation length`
			`- Graceful degradation: Intelligent trimming preserves conversation flow`

			`### User Experience`
			`- Seamless operation: Trimming happens transparently`
			`- Context preservation: Important system instructions always maintained`
			`- Recent focus: Most relevant (recent) conversation content preserved`

			`### Scalability`
			`- Long conversations: Supports indefinitely long conversations`
			`- Memory efficiency: Prevents unbounded memory growth`
			`- Performance: Minimal overhead for short conversations`

			`## Monitoring`

			`### Logging`
			`The system logs when trimming occurs:`
			```
			`INFO: Trimmed conversation history: 15 -> 8 messages`
			`INFO: Applied conversation history trimming for context management`
			```

			`### Metrics`
			`- Original message count vs. trimmed count`
			`- Token count estimation`
			`- Fallback usage frequency`

			`## Best Practices`

			`### For Administrators`
			`1. Monitor logs: Watch for frequent trimming (may indicate need for higher limits)`
			2. Tune limits: Adjust `max_tokens` based on your LLM provider's limits
			`3. Test with long conversations: Verify trimming behavior with realistic scenarios`

			`### For Developers`
			`1. System prompt optimization: Keep system prompts concise to maximize conversation space`
			`2. Tool response size: Consider tool response sizes in token calculations`
			`3. Custom trimming: Implement domain-specific trimming logic if needed`

			`## Troubleshooting`

			`### Common Issues`

			`#### "Trimming too aggressive"`
			- Increase `max_tokens` in configuration
			`- Check if system prompt is too long`
			`- Verify tool responses aren't excessively large`

			`#### "Still getting context errors"`
			`- Check if token counting is accurate for your model`
			`- Verify trimming is actually being applied (check logs)`
			`- Consider implementing custom token counting for specific models`

			`#### "Important context lost"`
			`- Review trimming strategy (currently keeps recent messages)`
			`- Consider implementing conversation summarization for older content`
			`- Adjust token allocation percentages`

			`## Future Enhancements`

			`### Planned Features`
			`1. Conversation summarization: Summarize older parts instead of discarding`
			`2. Smart context selection: Preserve important messages based on content`
			`3. Model-specific optimization: Tailored trimming for different LLM providers`
			`4. Adaptive limits: Dynamic token allocation based on conversation patterns`

			`### Configuration Extensions`
			`1. Per-session limits: Different limits for different conversation types`
			`2. Priority tagging: Mark important messages for preservation`
			`3. Custom strategies: Pluggable trimming algorithms`