Dual-model architecture reduces token consumption by half for long conversations

✍️ OpenClawRadar📅 Published: March 9, 2026🔗 Source

Context compression system for AI agents

A developer on r/ClaudeAI shared a solution to the problem of AI agents losing context after conversation compaction. The system uses a dual-model architecture where a cheap small model (called the "subconscious") continuously compresses conversation history in the background.

Architecture details

The system has four layers:

Narrative summary (~1K tokens)
Compressed factoids
Semantically retrieved verbatim quotes
Raw recent turns

The main model ("conscious") receives a curated ~35K token context with the same information density that would normally require 120K tokens of raw history. The main model reads one coherent timeline and doesn't know the memory system exists.

Performance results

The developer simulated 260 turns across different conversation types. For sustained project work (starting with heavy research and gradually shifting to quick exchanges as the model learns the domain), the system cuts token consumption roughly in half.

Development tools

The system was built with Claude Code for the simulation and Claude.ai in the consulting and research stage. The developer is looking for others who have tried routing a smaller model to manage context for a larger one or found other workarounds for the compaction problem.

📖 Read the full source: r/ClaudeAI

👀 See Also

🦀

Tools

Cocall.ai MCP: Outbound Phone Calls with Real-Time Human Escalation

Cocall.ai is an MCP for Claude that enables outbound phone calls with a full-duplex speech-to-speech model. It can pause mid-call to ask you a specific question instead of guessing, navigate IVR, and hand off calls to you when needed.

May 13, 2026, 06:17 PM UTC

OpenClawRadar

Tools

vllm-mlx fork adds tool calling and prompt cache for local AI coding agents

A developer has modified vllm-mlx to fix tool calling issues and add prompt caching, reducing TTFT from 28s to 0.3s for OpenClaw on Apple Silicon. The fork supports Qwen3-Coder-Next at 65 tok/s on M3 Ultra with working function calling.

Feb 26, 2026, 07:45 AM UTC

OpenClawRadar

Tools

Replacing complex retrieval pipelines with simple git commands for AI agents

A developer replaced their 3GB Docker image with sentence-transformers, rank-bm25, and scikit-learn with a single tool that lets AI agents execute read-only shell commands like git log, grep, and git diff directly on their memory repository.

Mar 20, 2026, 02:45 PM UTC

OpenClawRadar

Tools

Custom GIF Spinner for Claude Code via COLR Font Conversion

A developer created a method to replace Claude Code's default spinner with any animated GIF by converting the GIF into an OpenType COLR color font and patching the spinner to cycle through glyphs representing each frame. The tool currently supports Windows with macOS/Linux versions planned.

Mar 22, 2026, 07:45 PM UTC

OpenClawRadar