Qwen3-VL Benchmark: M5 Max 75-83% Faster vs M3/M4

Benchmark Setup and Hardware

A vision LLM classification pipeline was tested on technical drawings (PDFs at various megapixel resolutions) using LM Studio with MLX backend, streaming enabled, same 53-file test dataset, and same prompt. The task involves classification where the model analyzes an image and returns a short structured JSON response (~300-400 tokens), making inference heavily prefill-dominated with minimal token generation.

Hardware tested:

M3 Max: 40 GPU cores, 48 GB RAM, 400 GB/s memory bandwidth
M4 Max Studio: 40 GPU cores, 64 GB RAM, 546 GB/s memory bandwidth
M5 Max: 40 GPU cores, 64 GB RAM, 614 GB/s memory bandwidth

Models Tested

Qwen3-VL 8B: 8B parameters, 4-bit MLX quantization, ~5.8 GB on disk
Qwen3.5 9B: 9B parameters (dense, hybrid attention), 4-bit MLX quantization, ~6.2 GB on disk
Qwen3-VL 32B: 32B parameters, 4-bit MLX quantization, ~18 GB on disk

8B Model Results

Total time per image for Qwen3-VL 8B (4-bit):

4 MP: M3 Max 48GB: 16.5s, M4 Studio 64GB: 15.8s, M5 Max 64GB: 9.0s (M5 is 83% faster than M3)
5 MP: M3 Max: 20.3s, M4 Studio: 19.8s, M5 Max: 11.5s (77% faster)
6 MP: M3 Max: 24.1s, M4 Studio: 24.4s, M5 Max: 14.0s (72% faster)
7.5 MP: M4 Studio: 32.7s, M5 Max: 20.3s

The M3 Max and M4 Studio are basically identical on the 8B model, with total inference time within 3-5% despite M4 having 37% more memory bandwidth. The M5 Max is roughly 75-83% faster than both.

Why M3 and M4 Have Similar Speed

Prefill (prompt processing) scales with GPU compute cores, not memory bandwidth. Both chips have 40 GPU cores, so prefill speed is identical. For vision models, prefill dominates: TTFT (time to first token) is 70-85% of total inference time because the vision encoder does heavy compute work per image.

The M4 does show its bandwidth advantage in token generation: 76-80 T/s vs M3's 60-64 T/s (25% faster), matching the 37% bandwidth gap (546 vs 400 GB/s). However, for classification tasks with short outputs (~300-400 tokens), generation is only ~15% of total time, making the 25% generation speed advantage translate to just 3-5% end-to-end improvement.

32B Model Results

Total time per image for Qwen3-VL 32B (4-bit):

2 MP: M3 Max 48GB: 47.6s, M4 Studio 64GB: 35.3s, M5 Max 64GB: 21.2s
4 MP: M3 Max: 63.2s, M4 Studio: 50.0s, M5 Max: 27.4s
5 MP: M3 Max: 72.9s, M4 Studio: 59.2s, M5 Max: 30.7s
6 MP: M3 Max: 85.3s, M4 Studio: 78.0s, M5 Max: 35.6s

For longer generation tasks like summarization, description, or code generation, the M4's bandwidth advantage would matter more than in this classification workload.

📖 Read the full source: r/LocalLLaMA