Qwen3.5-397B Benchmark: 12.99 tok/s on M5 Max with 4-bit

Performance Results

A user benchmarked the flash-moe implementation on an M5 Max MacBook Pro with 128GB unified memory, running the mlx-community/Qwen3.5-397B-A17B-4bit model. The original benchmark by Dan Woods on an M3 Max with 48GB RAM achieved 4.36 tokens per second. On the M5 Max, the baseline configuration with 4-bit quantization and no cache-io-split reached 12.48 tok/s. With the optimal --cache-io-split 4 setting, performance increased to 12.99 tok/s, making it three times faster than the original benchmark.

Cache-IO-Split Analysis

The user performed a full sweep of cache-io-split values using the Anemll fork of flash-moe, which adds Metal 4 NAX support for M5+ chips. The results show that splits 2 and 3 degrade performance, while split 4 provides the best optimization:

cache-io-split 1 (none): 12.48 tok/s, 28.4ms expert I/O per token
cache-io-split 2: 9.94 tok/s, 28.2ms expert I/O per token
cache-io-split 3: 9.99 tok/s, 36.1ms expert I/O per token
cache-io-split 4: 12.99 tok/s, 25.9ms expert I/O per token
cache-io-split 5: 12.64 tok/s, 27.5ms expert I/O per token
cache-io-split 8: 12.90 tok/s, 26.4ms expert I/O per token

The analysis suggests that split 4 aligns with the M5 Max SSD controller's internal parallelism, while higher values add scheduling overhead. The recommendation is to use --cache-io-split 4 or no split at all, avoiding splits 2 and 3.

Quantization Comparison

Testing 2-bit versus 4-bit quantization revealed that 2-bit offers no speed advantage on the M5 Max, with SSD speed making smaller files unnecessary and dequantization overhead canceling any gains. Quality suffers significantly with 2-bit:

4-bit: 12.99 tok/s, 3.64 perplexity on WikiText-2
2-bit: ~12.65 tok/s, 5.71 perplexity on WikiText-2 (57% worse)

The conclusion is to use 4-bit quantization for better quality without sacrificing speed.

Technical Details

The benchmark used the Anemll fork available at https://github.com/Anemll/flash-moe. Sustained performance remained stable at 11.23 tok/s over 1000 tokens with no degradation. The user noted that background processes using Metal/GPU, such as LM Studio, can significantly impact performance and should be closed during benchmarking.

📖 Read the full source: r/LocalLLaMA