init

2025-09-26 17:15:54 +08:00
commit db0e5965ec
211 changed files with 40437 additions and 0 deletions
--- a/vw-agentic-rag/docs/topics/CONVERSATION_HISTORY_MANAGEMENT.md
+++ b/vw-agentic-rag/docs/topics/CONVERSATION_HISTORY_MANAGEMENT.md
@@ -0,0 +1,179 @@
+# Conversation History Management
+
+## Overview
+
+The system now automatically manages conversation history to prevent exceeding LLM context length limits. This ensures reliable operation for long-running conversations and prevents API failures due to token limit violations.
+
+## Key Features
+
+### Automatic Context Management
+- **Token-based trimming**: Uses LangChain's `trim_messages` utility for intelligent conversation truncation
+- **Configurable limits**: Defaults to 70% of max_tokens for conversation history (30% reserved for responses)
+- **Smart preservation**: Always preserves system messages and maintains conversation validity
+
+### Conversation Quality
+- **Valid flow**: Ensures conversations start with human messages and end with human/tool messages
+- **Recent priority**: Keeps the most recent messages when trimming is needed
+- **Graceful fallback**: Falls back to message count-based trimming if token counting fails
+
+## Configuration
+
+### Default Settings
+```yaml
+llm:
+  rag:
+    max_context_length: 96000    # Maximum context length for conversation history
+    # max_output_tokens:         # Optional: Limit LLM output tokens (default: no limit)
+    # Conversation history will use 85% = 81,600 tokens
+    # Response generation reserves 15% = 14,400 tokens
+```
+
+### Custom Configuration
+You can override the context length and optionally set output token limits:
+
+```python
+from service.graph.message_trimmer import create_conversation_trimmer
+
+# Use custom context length
+trimmer = create_conversation_trimmer(max_context_length=128000)
+```
+
+Configuration examples:
+```yaml
+# No output limit (default)
+llm:
+  rag:
+    max_context_length: 96000
+
+# With output limit
+llm:
+  rag:
+    max_context_length: 96000
+    max_output_tokens: 4000      # Limit LLM response to 4000 tokens
+```
+
+## How It Works
+
+### 1. Token Monitoring
+The system continuously monitors conversation length using approximate token counting.
+
+### 2. Trimming Logic
+When the conversation approaches the token limit:
+- Preserves the system message (contains important instructions)
+- Keeps the most recent conversation turns
+- Removes older messages to stay within limits
+- Maintains conversation validity (proper message sequence)
+
+### 3. Fallback Strategy
+If token counting fails:
+- Falls back to message count-based trimming
+- Keeps last 20 messages by default
+- Still preserves system messages
+
+## Implementation Details
+
+### Core Components
+
+#### ConversationTrimmer Class
+```python
+class ConversationTrimmer:
+    def __init__(self, max_context_length: int = 96000, preserve_system: bool = True)
+    
+    def should_trim(self, messages) -> bool
+    def trim_conversation_history(self, messages) -> List[BaseMessage]
+```
+
+#### Integration Point
+The trimming is automatically applied in the `call_model` function:
+
+```python
+# Create conversation trimmer for managing context length
+trimmer = create_conversation_trimmer()
+
+# Trim conversation history to manage context length
+if trimmer.should_trim(messages):
+    messages = trimmer.trim_conversation_history(messages)
+    logger.info("Applied conversation history trimming for context management")
+```
+
+### Token Allocation Strategy
+
+| Component | Token Allocation | Purpose |
+|-----------|------------------|---------|
+| Conversation History | 85% (81,600 tokens) | Maintains context |
+| Response Generation | 15% (14,400 tokens) | LLM output space |
+
+## Benefits
+
+### Reliability
+- **No more context overflow**: Prevents API failures due to token limits
+- **Consistent performance**: Maintains response quality regardless of conversation length
+- **Graceful degradation**: Intelligent trimming preserves conversation flow
+
+### User Experience
+- **Seamless operation**: Trimming happens transparently
+- **Context preservation**: Important system instructions always maintained
+- **Recent focus**: Most relevant (recent) conversation content preserved
+
+### Scalability
+- **Long conversations**: Supports indefinitely long conversations
+- **Memory efficiency**: Prevents unbounded memory growth
+- **Performance**: Minimal overhead for short conversations
+
+## Monitoring
+
+### Logging
+The system logs when trimming occurs:
+```
+INFO: Trimmed conversation history: 15 -> 8 messages
+INFO: Applied conversation history trimming for context management
+```
+
+### Metrics
+- Original message count vs. trimmed count
+- Token count estimation
+- Fallback usage frequency
+
+## Best Practices
+
+### For Administrators
+1. **Monitor logs**: Watch for frequent trimming (may indicate need for higher limits)
+2. **Tune limits**: Adjust `max_tokens` based on your LLM provider's limits
+3. **Test with long conversations**: Verify trimming behavior with realistic scenarios
+
+### For Developers
+1. **System prompt optimization**: Keep system prompts concise to maximize conversation space
+2. **Tool response size**: Consider tool response sizes in token calculations
+3. **Custom trimming**: Implement domain-specific trimming logic if needed
+
+## Troubleshooting
+
+### Common Issues
+
+#### "Trimming too aggressive"
+- Increase `max_tokens` in configuration
+- Check if system prompt is too long
+- Verify tool responses aren't excessively large
+
+#### "Still getting context errors"
+- Check if token counting is accurate for your model
+- Verify trimming is actually being applied (check logs)
+- Consider implementing custom token counting for specific models
+
+#### "Important context lost"
+- Review trimming strategy (currently keeps recent messages)
+- Consider implementing conversation summarization for older content
+- Adjust token allocation percentages
+
+## Future Enhancements
+
+### Planned Features
+1. **Conversation summarization**: Summarize older parts instead of discarding
+2. **Smart context selection**: Preserve important messages based on content
+3. **Model-specific optimization**: Tailored trimming for different LLM providers
+4. **Adaptive limits**: Dynamic token allocation based on conversation patterns
+
+### Configuration Extensions
+1. **Per-session limits**: Different limits for different conversation types
+2. **Priority tagging**: Mark important messages for preservation
+3. **Custom strategies**: Pluggable trimming algorithms