vllm-mlx fork adds tool calling and prompt cache for local AI coding agents

A developer has published a modified version of vllm-mlx that fixes several issues for running AI coding agents like OpenClaw locally on Mac. The fork adds working tool calling and prompt caching to the OpenAI-compatible server for Apple Silicon.
Key fixes and features
The developer made 37 commits on top of upstream vllm-mlx to address specific problems:
- Tool calling: Added
--tool-call-parser hermesflag — Qwen3-Coder-Next tool calls work out of the box - MiniMax-M2.5: Added streaming and non-streaming tool call parsing with 4/4 accuracy on function calling benchmarks (weather, search, code execution, multi-tool)
- Prompt cache: Added persistent KV cache across requests in SimpleEngine — same system prompt and conversation history only prefills new tokens
- Reasoning separation: Built heuristic parser for MiniMax outputs that had reasoning inline with no tags — reduced leak rate from 60% to 0%
Performance improvements
With 33K token context, time to first token (TTFT) improved from 28 seconds to 0.3 seconds on cache hit. Benchmarks on Mac Studio M3 Ultra 256GB:
- Qwen3-Coder-Next 4bit: 42GB RAM, 70 tok/s decode, 1270 tok/s prefill
- Qwen3-Coder-Next 6bit: 60GB RAM, 65 tok/s decode, 1090-1440 tok/s prefill
- Qwen3-Coder-Next 8bit: 75GB RAM, ~45 tok/s decode, ~900 tok/s prefill
- MiniMax-M2.5 4bit: 120GB RAM, 33-38 tok/s decode, 430-500 tok/s prefill
The developer recommends Qwen3-Coder-Next 6bit as the sweet spot for interactive coding, noting quality is noticeably better than 4bit (which had occasional garbled output).
Setup instructions
pip install git+https://github.com/raullenchai/vllm-mlx.git
python -c "from mlx_lm import load; load('lmstudio-community/Qwen3-Coder-Next-MLX-6bit')"
python -m vllm_mlx.server \
--model lmstudio-community/Qwen3-Coder-Next-MLX-6bit \
--tool-call-parser hermes \
--prefill-step-size 8192 \
--kv-bits 8 \
--port 8000
Then point OpenClaw or any OpenAI SDK client at http://localhost:8000/v1.
Hardware requirements
- Qwen3-Coder-Next 4bit: 42GB — fits on M2 Pro 64GB or better
- Qwen3-Coder-Next 6bit: 60GB — needs M2/M3/M4 Max 96GB+ or Ultra
- MiniMax-M2.5: 120GB — Ultra 192GB+ only
What didn't work
- Speculative decoding with Qwen3-0.6B as draft model — mlx-lm has a known bug with Qwen3 (skips tokens, issue #846)
- DeepSeek-R1-Distill-70B for OpenClaw — great at reasoning but tool calling is unreliable
The repository has 1500+ tests and is licensed under Apache 2.0.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Yavio: Open-Source Product Analytics SDK for MCP Apps
Yavio is an open-source product analytics SDK for MCP and MCP Apps that automatically captures tool calls, errors, and resource reads with one function call. The MIT-licensed project provides a dashboard with per-tool breakdowns, funnels, retention, and error tracking.

harshal-mcp-proxy Now on npm: Single Daemon Replaces 12 MCP Server Configs
harshal-mcp-proxy is now available as a 54 kB npm package. Install globally, run as a daemon, and replace 12 separate MCP server configs with 6 tools, saving ~2.7 GB RAM and ~50K tokens per session.

Ollama's Technical Issues and Community Controversy
Ollama, a popular local LLM tool, faces criticism for downplaying its reliance on llama.cpp, license compliance issues, and technical problems with its custom backend including performance regressions and reintroduced bugs.

Cloudflare Dynamic Worker Loader: Sandboxing AI Agents with Isolates
Cloudflare's Dynamic Worker Loader API, now in open beta, allows Workers to instantiate new Workers with runtime-specified code in isolated sandboxes using V8 isolates, offering 100x faster startup than containers and no global concurrency limits.