KV Cache Reuse for Long Conversations on Apple Silicon Delivers 200x Speedup

✍️ OpenClawRadar📅 Published: March 15, 2026🔗 Source
KV Cache Reuse for Long Conversations on Apple Silicon Delivers 200x Speedup
Ad

What This Is

A developer shared experimental results from implementing session-based KV (key-value) cache reuse for local LLM inference on Apple Silicon using the MLX framework. The goal was to make long conversations (100K+ tokens) practical by eliminating the need to reprocess the entire context on each turn.

Key Findings and Benchmarks

The core approach involved keeping the KV cache in memory across conversation turns and only processing new tokens. This simple idea yielded dramatic performance improvements:

  • 200x TTFT Improvement at 100K Context: Without cache: 126 seconds. With cache: 0.5 seconds. This represents a 99.9% reduction in tokens processed.
  • Real-World Session Numbers: Testing with a Qwen3.5-397B model on an M3 Ultra 512GB Mac Studio during a 266-message OpenClaw agent session showed:
    • Cache hit rate: 93.8%
    • TTFT for cache hits (<500 new tokens): 1.0-1.3 seconds
    • TTFT for a full cache miss (124K tokens): 528 seconds (8.8 minutes)
Ad

What Didn't Work

The developer tested several optimization attempts that failed or degraded performance:

  • Trimming Thinking Tokens: Attempting to remove the model's internal reasoning tokens from the cache to save space caused pathological behavior. Responses became 31% longer and quality dropped, as the model references its past reasoning across turns.
  • Rotating KV Cache (8192 tokens): While this provided the best tokens-per-second (TPS) rate, it caused the model to lose earlier context, with recall dropping significantly (to 4 out of 8 items).
  • KV 8-bit Quantization: This resulted in a 16.5% drop in TPS, as the computational overhead exceeded the memory bandwidth savings.

Implementation and Hardware

The implementation is part of an open-source personal project called SoloHeaven, available under an MIT license on GitHub: https://github.com/joongom/mlx-soloheaven. The README contains full benchmark tables.

Testing was conducted on a Mac Studio M3 Ultra with 512GB RAM and 4TB storage, using the following models converted for MLX:

  • Qwen3.5-122B-A10B-bf16
  • Qwen3.5-397B-A17B-MLX-8bit

📖 Read the full source: r/LocalLLaMA

Ad

👀 See Also