# Conversation History Management ## Overview The system now automatically manages conversation history to prevent exceeding LLM context length limits. This ensures reliable operation for long-running conversations and prevents API failures due to token limit violations. ## Key Features ### Automatic Context Management - **Token-based trimming**: Uses LangChain's `trim_messages` utility for intelligent conversation truncation - **Configurable limits**: Defaults to 70% of max_tokens for conversation history (30% reserved for responses) - **Smart preservation**: Always preserves system messages and maintains conversation validity ### Conversation Quality - **Valid flow**: Ensures conversations start with human messages and end with human/tool messages - **Recent priority**: Keeps the most recent messages when trimming is needed - **Graceful fallback**: Falls back to message count-based trimming if token counting fails ## Configuration ### Default Settings ```yaml llm: rag: max_context_length: 96000 # Maximum context length for conversation history # max_output_tokens: # Optional: Limit LLM output tokens (default: no limit) # Conversation history will use 85% = 81,600 tokens # Response generation reserves 15% = 14,400 tokens ``` ### Custom Configuration You can override the context length and optionally set output token limits: ```python from service.graph.message_trimmer import create_conversation_trimmer # Use custom context length trimmer = create_conversation_trimmer(max_context_length=128000) ``` Configuration examples: ```yaml # No output limit (default) llm: rag: max_context_length: 96000 # With output limit llm: rag: max_context_length: 96000 max_output_tokens: 4000 # Limit LLM response to 4000 tokens ``` ## How It Works ### 1. Token Monitoring The system continuously monitors conversation length using approximate token counting. ### 2. Trimming Logic When the conversation approaches the token limit: - Preserves the system message (contains important instructions) - Keeps the most recent conversation turns - Removes older messages to stay within limits - Maintains conversation validity (proper message sequence) ### 3. Fallback Strategy If token counting fails: - Falls back to message count-based trimming - Keeps last 20 messages by default - Still preserves system messages ## Implementation Details ### Core Components #### ConversationTrimmer Class ```python class ConversationTrimmer: def __init__(self, max_context_length: int = 96000, preserve_system: bool = True) def should_trim(self, messages) -> bool def trim_conversation_history(self, messages) -> List[BaseMessage] ``` #### Integration Point The trimming is automatically applied in the `call_model` function: ```python # Create conversation trimmer for managing context length trimmer = create_conversation_trimmer() # Trim conversation history to manage context length if trimmer.should_trim(messages): messages = trimmer.trim_conversation_history(messages) logger.info("Applied conversation history trimming for context management") ``` ### Token Allocation Strategy | Component | Token Allocation | Purpose | |-----------|------------------|---------| | Conversation History | 85% (81,600 tokens) | Maintains context | | Response Generation | 15% (14,400 tokens) | LLM output space | ## Benefits ### Reliability - **No more context overflow**: Prevents API failures due to token limits - **Consistent performance**: Maintains response quality regardless of conversation length - **Graceful degradation**: Intelligent trimming preserves conversation flow ### User Experience - **Seamless operation**: Trimming happens transparently - **Context preservation**: Important system instructions always maintained - **Recent focus**: Most relevant (recent) conversation content preserved ### Scalability - **Long conversations**: Supports indefinitely long conversations - **Memory efficiency**: Prevents unbounded memory growth - **Performance**: Minimal overhead for short conversations ## Monitoring ### Logging The system logs when trimming occurs: ``` INFO: Trimmed conversation history: 15 -> 8 messages INFO: Applied conversation history trimming for context management ``` ### Metrics - Original message count vs. trimmed count - Token count estimation - Fallback usage frequency ## Best Practices ### For Administrators 1. **Monitor logs**: Watch for frequent trimming (may indicate need for higher limits) 2. **Tune limits**: Adjust `max_tokens` based on your LLM provider's limits 3. **Test with long conversations**: Verify trimming behavior with realistic scenarios ### For Developers 1. **System prompt optimization**: Keep system prompts concise to maximize conversation space 2. **Tool response size**: Consider tool response sizes in token calculations 3. **Custom trimming**: Implement domain-specific trimming logic if needed ## Troubleshooting ### Common Issues #### "Trimming too aggressive" - Increase `max_tokens` in configuration - Check if system prompt is too long - Verify tool responses aren't excessively large #### "Still getting context errors" - Check if token counting is accurate for your model - Verify trimming is actually being applied (check logs) - Consider implementing custom token counting for specific models #### "Important context lost" - Review trimming strategy (currently keeps recent messages) - Consider implementing conversation summarization for older content - Adjust token allocation percentages ## Future Enhancements ### Planned Features 1. **Conversation summarization**: Summarize older parts instead of discarding 2. **Smart context selection**: Preserve important messages based on content 3. **Model-specific optimization**: Tailored trimming for different LLM providers 4. **Adaptive limits**: Dynamic token allocation based on conversation patterns ### Configuration Extensions 1. **Per-session limits**: Different limits for different conversation types 2. **Priority tagging**: Mark important messages for preservation 3. **Custom strategies**: Pluggable trimming algorithms