vllm-mlx Fork Fixes Tool Calling, Adds Prompt Cache

A developer has published a modified version of vllm-mlx that fixes several issues for running AI coding agents like OpenClaw locally on Mac. The fork adds working tool calling and prompt caching to the OpenAI-compatible server for Apple Silicon.

Key fixes and features

The developer made 37 commits on top of upstream vllm-mlx to address specific problems:

Tool calling: Added --tool-call-parser hermes flag — Qwen3-Coder-Next tool calls work out of the box
MiniMax-M2.5: Added streaming and non-streaming tool call parsing with 4/4 accuracy on function calling benchmarks (weather, search, code execution, multi-tool)
Prompt cache: Added persistent KV cache across requests in SimpleEngine — same system prompt and conversation history only prefills new tokens
Reasoning separation: Built heuristic parser for MiniMax outputs that had reasoning inline with no tags — reduced leak rate from 60% to 0%

Performance improvements

With 33K token context, time to first token (TTFT) improved from 28 seconds to 0.3 seconds on cache hit. Benchmarks on Mac Studio M3 Ultra 256GB:

Qwen3-Coder-Next 4bit: 42GB RAM, 70 tok/s decode, 1270 tok/s prefill
Qwen3-Coder-Next 6bit: 60GB RAM, 65 tok/s decode, 1090-1440 tok/s prefill
Qwen3-Coder-Next 8bit: 75GB RAM, ~45 tok/s decode, ~900 tok/s prefill
MiniMax-M2.5 4bit: 120GB RAM, 33-38 tok/s decode, 430-500 tok/s prefill

The developer recommends Qwen3-Coder-Next 6bit as the sweet spot for interactive coding, noting quality is noticeably better than 4bit (which had occasional garbled output).

Setup instructions

pip install git+https://github.com/raullenchai/vllm-mlx.git
python -c "from mlx_lm import load; load('lmstudio-community/Qwen3-Coder-Next-MLX-6bit')"
python -m vllm_mlx.server \
  --model lmstudio-community/Qwen3-Coder-Next-MLX-6bit \
  --tool-call-parser hermes \
  --prefill-step-size 8192 \
  --kv-bits 8 \
  --port 8000

Then point OpenClaw or any OpenAI SDK client at http://localhost:8000/v1.

Hardware requirements

Qwen3-Coder-Next 4bit: 42GB — fits on M2 Pro 64GB or better
Qwen3-Coder-Next 6bit: 60GB — needs M2/M3/M4 Max 96GB+ or Ultra
MiniMax-M2.5: 120GB — Ultra 192GB+ only

What didn't work

Speculative decoding with Qwen3-0.6B as draft model — mlx-lm has a known bug with Qwen3 (skips tokens, issue #846)
DeepSeek-R1-Distill-70B for OpenClaw — great at reasoning but tool calling is unreliable

The repository has 1500+ tests and is licensed under Apache 2.0.

📖 Read the full source: r/LocalLLaMA