Dual-model architecture reduces token consumption by half for long conversations

Context compression system for AI agents
A developer on r/ClaudeAI shared a solution to the problem of AI agents losing context after conversation compaction. The system uses a dual-model architecture where a cheap small model (called the "subconscious") continuously compresses conversation history in the background.
Architecture details
The system has four layers:
- Narrative summary (~1K tokens)
- Compressed factoids
- Semantically retrieved verbatim quotes
- Raw recent turns
The main model ("conscious") receives a curated ~35K token context with the same information density that would normally require 120K tokens of raw history. The main model reads one coherent timeline and doesn't know the memory system exists.
Performance results
The developer simulated 260 turns across different conversation types. For sustained project work (starting with heavy research and gradually shifting to quick exchanges as the model learns the domain), the system cuts token consumption roughly in half.
Development tools
The system was built with Claude Code for the simulation and Claude.ai in the consulting and research stage. The developer is looking for others who have tried routing a smaller model to manage context for a larger one or found other workarounds for the compaction problem.
📖 Read the full source: r/ClaudeAI
👀 See Also
Cocall.ai MCP: Outbound Phone Calls with Real-Time Human Escalation
Cocall.ai is an MCP for Claude that enables outbound phone calls with a full-duplex speech-to-speech model. It can pause mid-call to ask you a specific question instead of guessing, navigate IVR, and hand off calls to you when needed.

vllm-mlx fork adds tool calling and prompt cache for local AI coding agents
A developer has modified vllm-mlx to fix tool calling issues and add prompt caching, reducing TTFT from 28s to 0.3s for OpenClaw on Apple Silicon. The fork supports Qwen3-Coder-Next at 65 tok/s on M3 Ultra with working function calling.

Replacing complex retrieval pipelines with simple git commands for AI agents
A developer replaced their 3GB Docker image with sentence-transformers, rank-bm25, and scikit-learn with a single tool that lets AI agents execute read-only shell commands like git log, grep, and git diff directly on their memory repository.

Custom GIF Spinner for Claude Code via COLR Font Conversion
A developer created a method to replace Claude Code's default spinner with any animated GIF by converting the GIF into an OpenType COLR color font and patching the spinner to cycle through glyphs representing each frame. The tool currently supports Windows with macOS/Linux versions planned.