Dual-model architecture reduces token consumption by half for long conversations

✍️ OpenClawRadar📅 Published: March 9, 2026🔗 Source
Dual-model architecture reduces token consumption by half for long conversations
Ad

Context compression system for AI agents

A developer on r/ClaudeAI shared a solution to the problem of AI agents losing context after conversation compaction. The system uses a dual-model architecture where a cheap small model (called the "subconscious") continuously compresses conversation history in the background.

Architecture details

The system has four layers:

  • Narrative summary (~1K tokens)
  • Compressed factoids
  • Semantically retrieved verbatim quotes
  • Raw recent turns

The main model ("conscious") receives a curated ~35K token context with the same information density that would normally require 120K tokens of raw history. The main model reads one coherent timeline and doesn't know the memory system exists.

Ad

Performance results

The developer simulated 260 turns across different conversation types. For sustained project work (starting with heavy research and gradually shifting to quick exchanges as the model learns the domain), the system cuts token consumption roughly in half.

Development tools

The system was built with Claude Code for the simulation and Claude.ai in the consulting and research stage. The developer is looking for others who have tried routing a smaller model to manage context for a larger one or found other workarounds for the compaction problem.

📖 Read the full source: r/ClaudeAI

Ad

👀 See Also