Context Quality Degradation: Hallucinations Rise from 3% to 28% with Token Count

Context Window Performance Testing Results

A developer tested context quality degradation across different token counts in AI agents, revealing significant performance issues as context size increases.

Key Findings from Testing

The testing measured several critical metrics:

Hallucination rates by context size:
- 10K tokens: ~3%
- 50K tokens: ~11%
- 200K tokens: ~28%
- 1M tokens: unclear, but the trend shows increasing degradation
Recall accuracy: No tested model (including GPT-4, Claude, or local models) achieved 90% recall on information from the first 10 turns once context exceeded 50K tokens.
Token efficiency: At 200K tokens, the percentage of context actually relevant to the current query drops below 12% in most agent tasks, meaning approximately 188K tokens add noise that the model must reason around.

Problem Analysis

The issue appears to be attention starvation rather than forgetting. Early context competes with recent context, with recent context usually winning due to higher positional relevance. This causes constraints set early in sessions (like "use PostgreSQL, no ORMs") to become progressively diluted as more context accumulates.

By turn 89 with 200K tokens, the model's attention is so spread across the context that early constraints effectively disappear.

Current Solutions and Limitations

Many developers add vector databases to retrieve "relevant" memories, which helps somewhat. However, this approach retrieves semantically similar content rather than what the agent needs for correct reasoning. For example, "use PostgreSQL" is not semantically similar to "write me a login endpoint" even though it needs to be in context for proper execution.

The developer is seeking feedback on whether these findings match production experiences and what approaches have actually worked for others.

📖 Read the full source: r/LocalLLaMA