Files
catonline_ai/vw-agentic-rag/docs/topics/CONVERSATION_HISTORY_MANAGEMENT.md
2025-09-26 17:15:54 +08:00

6.0 KiB

Conversation History Management

Overview

The system now automatically manages conversation history to prevent exceeding LLM context length limits. This ensures reliable operation for long-running conversations and prevents API failures due to token limit violations.

Key Features

Automatic Context Management

  • Token-based trimming: Uses LangChain's trim_messages utility for intelligent conversation truncation
  • Configurable limits: Defaults to 70% of max_tokens for conversation history (30% reserved for responses)
  • Smart preservation: Always preserves system messages and maintains conversation validity

Conversation Quality

  • Valid flow: Ensures conversations start with human messages and end with human/tool messages
  • Recent priority: Keeps the most recent messages when trimming is needed
  • Graceful fallback: Falls back to message count-based trimming if token counting fails

Configuration

Default Settings

llm:
  rag:
    max_context_length: 96000    # Maximum context length for conversation history
    # max_output_tokens:         # Optional: Limit LLM output tokens (default: no limit)
    # Conversation history will use 85% = 81,600 tokens
    # Response generation reserves 15% = 14,400 tokens

Custom Configuration

You can override the context length and optionally set output token limits:

from service.graph.message_trimmer import create_conversation_trimmer

# Use custom context length
trimmer = create_conversation_trimmer(max_context_length=128000)

Configuration examples:

# No output limit (default)
llm:
  rag:
    max_context_length: 96000

# With output limit
llm:
  rag:
    max_context_length: 96000
    max_output_tokens: 4000      # Limit LLM response to 4000 tokens

How It Works

1. Token Monitoring

The system continuously monitors conversation length using approximate token counting.

2. Trimming Logic

When the conversation approaches the token limit:

  • Preserves the system message (contains important instructions)
  • Keeps the most recent conversation turns
  • Removes older messages to stay within limits
  • Maintains conversation validity (proper message sequence)

3. Fallback Strategy

If token counting fails:

  • Falls back to message count-based trimming
  • Keeps last 20 messages by default
  • Still preserves system messages

Implementation Details

Core Components

ConversationTrimmer Class

class ConversationTrimmer:
    def __init__(self, max_context_length: int = 96000, preserve_system: bool = True)
    
    def should_trim(self, messages) -> bool
    def trim_conversation_history(self, messages) -> List[BaseMessage]

Integration Point

The trimming is automatically applied in the call_model function:

# Create conversation trimmer for managing context length
trimmer = create_conversation_trimmer()

# Trim conversation history to manage context length
if trimmer.should_trim(messages):
    messages = trimmer.trim_conversation_history(messages)
    logger.info("Applied conversation history trimming for context management")

Token Allocation Strategy

Component Token Allocation Purpose
Conversation History 85% (81,600 tokens) Maintains context
Response Generation 15% (14,400 tokens) LLM output space

Benefits

Reliability

  • No more context overflow: Prevents API failures due to token limits
  • Consistent performance: Maintains response quality regardless of conversation length
  • Graceful degradation: Intelligent trimming preserves conversation flow

User Experience

  • Seamless operation: Trimming happens transparently
  • Context preservation: Important system instructions always maintained
  • Recent focus: Most relevant (recent) conversation content preserved

Scalability

  • Long conversations: Supports indefinitely long conversations
  • Memory efficiency: Prevents unbounded memory growth
  • Performance: Minimal overhead for short conversations

Monitoring

Logging

The system logs when trimming occurs:

INFO: Trimmed conversation history: 15 -> 8 messages
INFO: Applied conversation history trimming for context management

Metrics

  • Original message count vs. trimmed count
  • Token count estimation
  • Fallback usage frequency

Best Practices

For Administrators

  1. Monitor logs: Watch for frequent trimming (may indicate need for higher limits)
  2. Tune limits: Adjust max_tokens based on your LLM provider's limits
  3. Test with long conversations: Verify trimming behavior with realistic scenarios

For Developers

  1. System prompt optimization: Keep system prompts concise to maximize conversation space
  2. Tool response size: Consider tool response sizes in token calculations
  3. Custom trimming: Implement domain-specific trimming logic if needed

Troubleshooting

Common Issues

"Trimming too aggressive"

  • Increase max_tokens in configuration
  • Check if system prompt is too long
  • Verify tool responses aren't excessively large

"Still getting context errors"

  • Check if token counting is accurate for your model
  • Verify trimming is actually being applied (check logs)
  • Consider implementing custom token counting for specific models

"Important context lost"

  • Review trimming strategy (currently keeps recent messages)
  • Consider implementing conversation summarization for older content
  • Adjust token allocation percentages

Future Enhancements

Planned Features

  1. Conversation summarization: Summarize older parts instead of discarding
  2. Smart context selection: Preserve important messages based on content
  3. Model-specific optimization: Tailored trimming for different LLM providers
  4. Adaptive limits: Dynamic token allocation based on conversation patterns

Configuration Extensions

  1. Per-session limits: Different limits for different conversation types
  2. Priority tagging: Mark important messages for preservation
  3. Custom strategies: Pluggable trimming algorithms