KV Cache Reuse for Long Conversations on Apple Silicon Delivers 200x Speedup

What This Is
A developer shared experimental results from implementing session-based KV (key-value) cache reuse for local LLM inference on Apple Silicon using the MLX framework. The goal was to make long conversations (100K+ tokens) practical by eliminating the need to reprocess the entire context on each turn.
Key Findings and Benchmarks
The core approach involved keeping the KV cache in memory across conversation turns and only processing new tokens. This simple idea yielded dramatic performance improvements:
- 200x TTFT Improvement at 100K Context: Without cache: 126 seconds. With cache: 0.5 seconds. This represents a 99.9% reduction in tokens processed.
- Real-World Session Numbers: Testing with a Qwen3.5-397B model on an M3 Ultra 512GB Mac Studio during a 266-message OpenClaw agent session showed:
- Cache hit rate: 93.8%
- TTFT for cache hits (<500 new tokens): 1.0-1.3 seconds
- TTFT for a full cache miss (124K tokens): 528 seconds (8.8 minutes)
What Didn't Work
The developer tested several optimization attempts that failed or degraded performance:
- Trimming Thinking Tokens: Attempting to remove the model's internal reasoning tokens from the cache to save space caused pathological behavior. Responses became 31% longer and quality dropped, as the model references its past reasoning across turns.
- Rotating KV Cache (8192 tokens): While this provided the best tokens-per-second (TPS) rate, it caused the model to lose earlier context, with recall dropping significantly (to 4 out of 8 items).
- KV 8-bit Quantization: This resulted in a 16.5% drop in TPS, as the computational overhead exceeded the memory bandwidth savings.
Implementation and Hardware
The implementation is part of an open-source personal project called SoloHeaven, available under an MIT license on GitHub: https://github.com/joongom/mlx-soloheaven. The README contains full benchmark tables.
Testing was conducted on a Mac Studio M3 Ultra with 512GB RAM and 4TB storage, using the following models converted for MLX:
- Qwen3.5-122B-A10B-bf16
- Qwen3.5-397B-A17B-MLX-8bit
📖 Read the full source: r/LocalLLaMA
👀 See Also

Claude Code's Monitor tool pipes dev server logs into AI-driven auto-fixes
Claude Code's Monitor tool lets you run a dev server in background, tail logs with smart grep filters, and have Claude auto-detect errors, write fixes, and commit them — all while you test the UI.

skill-depot: A Local-First Memory and Skill System for MCP-Compatible AI Agents
skill-depot is a retrieval system that stores agent knowledge as Markdown files and uses vector embeddings to semantically search and selectively load only relevant content. It runs 100% locally with no API keys, works with any MCP-compatible agent, and can be set up with npx skill-depot init.

Blitz: Claude Code Tool for App Store Submissions
Blitz is a free tool that gives Claude Code the ability to automate App Store submissions via MCP tool calls. Users can ask Claude to 'submit my app to the app store' to handle certificates, screenshots, and App Store Connect forms.

Claude Desktop App Cowork Function Enables AI-to-AI Communication via Shared Google Docs
Users successfully implemented Claude-to-Claude communication using the new cowork function in the desktop app, with two AI agents reading and writing to a shared Google Doc in a structured five-exchange dialogue.