Apple Silicon Benchmark: Qwen3-VL Performance on M3, M4, and M5 Max for Vision LLM Classification

Benchmark Setup and Hardware
A vision LLM classification pipeline was tested on technical drawings (PDFs at various megapixel resolutions) using LM Studio with MLX backend, streaming enabled, same 53-file test dataset, and same prompt. The task involves classification where the model analyzes an image and returns a short structured JSON response (~300-400 tokens), making inference heavily prefill-dominated with minimal token generation.
Hardware tested:
- M3 Max: 40 GPU cores, 48 GB RAM, 400 GB/s memory bandwidth
- M4 Max Studio: 40 GPU cores, 64 GB RAM, 546 GB/s memory bandwidth
- M5 Max: 40 GPU cores, 64 GB RAM, 614 GB/s memory bandwidth
Models Tested
- Qwen3-VL 8B: 8B parameters, 4-bit MLX quantization, ~5.8 GB on disk
- Qwen3.5 9B: 9B parameters (dense, hybrid attention), 4-bit MLX quantization, ~6.2 GB on disk
- Qwen3-VL 32B: 32B parameters, 4-bit MLX quantization, ~18 GB on disk
8B Model Results
Total time per image for Qwen3-VL 8B (4-bit):
- 4 MP: M3 Max 48GB: 16.5s, M4 Studio 64GB: 15.8s, M5 Max 64GB: 9.0s (M5 is 83% faster than M3)
- 5 MP: M3 Max: 20.3s, M4 Studio: 19.8s, M5 Max: 11.5s (77% faster)
- 6 MP: M3 Max: 24.1s, M4 Studio: 24.4s, M5 Max: 14.0s (72% faster)
- 7.5 MP: M4 Studio: 32.7s, M5 Max: 20.3s
The M3 Max and M4 Studio are basically identical on the 8B model, with total inference time within 3-5% despite M4 having 37% more memory bandwidth. The M5 Max is roughly 75-83% faster than both.
Why M3 and M4 Have Similar Speed
Prefill (prompt processing) scales with GPU compute cores, not memory bandwidth. Both chips have 40 GPU cores, so prefill speed is identical. For vision models, prefill dominates: TTFT (time to first token) is 70-85% of total inference time because the vision encoder does heavy compute work per image.
The M4 does show its bandwidth advantage in token generation: 76-80 T/s vs M3's 60-64 T/s (25% faster), matching the 37% bandwidth gap (546 vs 400 GB/s). However, for classification tasks with short outputs (~300-400 tokens), generation is only ~15% of total time, making the 25% generation speed advantage translate to just 3-5% end-to-end improvement.
32B Model Results
Total time per image for Qwen3-VL 32B (4-bit):
- 2 MP: M3 Max 48GB: 47.6s, M4 Studio 64GB: 35.3s, M5 Max 64GB: 21.2s
- 4 MP: M3 Max: 63.2s, M4 Studio: 50.0s, M5 Max: 27.4s
- 5 MP: M3 Max: 72.9s, M4 Studio: 59.2s, M5 Max: 30.7s
- 6 MP: M3 Max: 85.3s, M4 Studio: 78.0s, M5 Max: 35.6s
For longer generation tasks like summarization, description, or code generation, the M4's bandwidth advantage would matter more than in this classification workload.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Claude Code adds voice mode for hands-free coding commands
Anthropic is rolling out voice mode for Claude Code, its AI coding assistant, allowing developers to interact via spoken commands. The feature is currently live for about 5% of users with broader availability planned in coming weeks.

Claude Code evolving into an engineering OS rather than just AI code chat
A Reddit discussion argues Claude Code is becoming less like AI chat for coding and more like an engineering operating system with planning, code review, cloud agents, and autonomous workflows.

Coinbase x402 vs Google A2A: Two Opposite Payment Orderings for Agent-to-Agent Payments
Building agent-to-agent payments reveals a fundamental split: Coinbase's x402 middleware settles after work (verify→run→settle), while Google's A2A extension settles before (verify→settle→run) for slow agentic calls.

AI Models Lack Self-Knowledge of Their Own Tools and UI
AI models like ChatGPT and Claude often provide incorrect or outdated information about their own features and interfaces, such as denying new slash commands exist or describing old UI versions, because they're trained on past snapshots while products evolve constantly.