oMLX introduces SSD KV caching for Apple Silicon, reducing OpenClaw response times from 30-90 seconds to 5 seconds

✍️ OpenClawRadar📅 Published: March 7, 2026🔗 Source
oMLX introduces SSD KV caching for Apple Silicon, reducing OpenClaw response times from 30-90 seconds to 5 seconds
Ad

What oMLX solves

Running OpenClaw locally typically means sending the same massive system prompt (20-30k tokens covering tools, skills, workspace context) on every request. While Ollama and LM Studio cache KV state, they invalidate the entire cache and recompute from scratch when context shifts mid-session, resulting in 30-90 second response times.

oMLX fixes this by persisting KV cache blocks to SSD in safetensors format. When a previously seen prefix returns, it's restored from disk instead of recomputed - working across requests and server restarts. Since OpenClaw's system prompt is mostly static (only timestamps and runtime metadata shift), SSD caching means only changed parts get recomputed.

Performance benchmarks

Tested with Qwen3.5-122B-A10B-4bit on M3 Ultra 512GB:

  • Single request benchmarks:
    • 1k context: 768 tok/s prompt processing, 56.6 tok/s generation, 65.5 GB peak memory
    • 8k context: 940 tok/s prompt processing, 51.4 tok/s generation, 69.3 GB peak memory
    • 32k context: 764 tok/s prompt processing, 42.4 tok/s generation, 73.4 GB peak memory
  • Continuous batching (pp1024/tg128):
    • 1x batch: 56.6 tok/s, 1.00x speedup
    • 2x batch: 92.1 tok/s, 1.63x speedup
    • 4x batch: 135.1 tok/s, 2.39x speedup
    • 8x batch: 190.2 tok/s, 3.36x speedup
Ad

Setup with OpenClaw

  • Download the DMG from releases and drag to Applications
  • Point it at your model directory (reuses LM Studio models, no re-download needed)
  • Add oMLX as a custom provider in openclaw.json
  • The web dashboard generates the exact config - no terminal needed

Additional features

  • Multi-model serving: LLM + embedding + reranker simultaneously
  • Tool calling for all major formats (JSON, Qwen, Gemma, GLM) + MCP
  • Tool result trimming - truncates oversized tool outputs
  • OpenAI + Anthropic /v1/messages drop-in compatibility
  • Native macOS menu bar app (not Electron)
  • Apache 2.0 license, 100% open source

📖 Read the full source: r/openclaw

Ad

👀 See Also