Benchmark: MLX vs Ollama Running Qwen3-Coder-Next 8-Bit on M5 Max MacBook Pro

A benchmark was conducted comparing two local inference backends—MLX (Apple's native ML framework) and Ollama (llama.cpp-based)—running the same Qwen3-Coder-Next model in 8-bit quantization on Apple Silicon. The goal was to measure raw throughput (tokens per second), time to first token (TTFT), and overall coding capability across real-world programming tasks.
Methodology
The setup used:
- MLX backend: mlx-lm v0.29.1 serving mlx-community/Qwen3-Coder-Next-8bit via its built-in OpenAI-compatible HTTP server on port 8080.
- Ollama backend: Ollama serving qwen3-coder-next:Q8_0 via its OpenAI-compatible API on port 11434.
Both backends were accessed through the same Python benchmark harness using the OpenAI client library with streaming enabled. Each test was run 3 iterations per prompt, with results averaged and excluding the first iteration's TTFT for the initial cold-start prompt (model load).
Test Suite
Six prompts covered a spectrum of coding tasks:
- Short Completion: Write a palindrome check function (150 max tokens)
- Medium Generation: Implement an LRU cache class with type hints (500 max tokens)
- Long Reasoning: Explain async/await vs threading with examples (1000 max tokens)
- Debug Task: Find and fix bugs in merge sort + binary search (800 max tokens)
- Complex Coding: Thread-safe bounded blocking queue with context manager (1000 max tokens)
- Code Review: Review 3 functions for performance/correctness/style (1000 max tokens)
Results
Throughput (Tokens per Second) on M5 Max with 128GB RAM:
- Short Completion: Ollama 32.51 tok/s, MLX 69.62 tok/s (MLX +114%)
- Medium Generation: Ollama 35.97 tok/s, MLX 78.28 tok/s (MLX +118%)
- Long Reasoning: Ollama 40.45 tok/s, MLX 78.29 tok/s (MLX +94%)
- Debug Task: Ollama 37.06 tok/s, MLX 74.89 tok/s (MLX +102%)
- Complex Coding: Ollama 35.84 tok/s, MLX 76.99 tok/s (MLX +115%)
- Code Review: Ollama 39.00 tok/s, MLX 74.98 tok/s (MLX +92%)
Overall average: MLX achieved approximately 72 tokens per second, roughly double Ollama's throughput. Metrics measured included tokens/sec (output tokens generated per second, higher is better), TTFT (time from request sent to first token received, lower is better), total time (wall-clock time for full response, lower is better), and memory usage measured via psutil.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Claude Code Best Practice GitHub repository reaches 5,000 stars
A GitHub repository called 'claude-code-best-practice' has reached 5,000 stars. The repository was created with Claude to document best practices, tips, and workflows from both the creator and the community.

Claude Code Plugin /verify: Automated Browser Testing from Your Plan
/verify is an open-source Claude Code plugin that reads your plan, spins up a real browser via Playwright MCP, checks each requirement, and gives you a pass/fail report with screenshots.

OpenClaw Optimizer v1.18.0 released with OpenClaw v2026.3.7 alignment
OpenClaw Optimizer skill v1.18.0 is now aligned with OpenClaw v2026.3.7, adding support for new AI providers including Google Gemini 3.1 Flash-Lite and OpenAI gpt-5.4, plus new CLI commands like /session idle and /usage cost.

Claude Code v2.1.59 adds auto-memory, copy command, and shell improvements
Claude Code v2.1.59 introduces automatic context saving to auto-memory with /memory management, adds a /copy command for interactive code block selection, and improves prefix suggestions for compound bash commands.