Flash-MOE Benchmark on M5 Max: 12.99 tok/s with Qwen3.5-397B

Performance Results
A user benchmarked the flash-moe implementation on an M5 Max MacBook Pro with 128GB unified memory, running the mlx-community/Qwen3.5-397B-A17B-4bit model. The original benchmark by Dan Woods on an M3 Max with 48GB RAM achieved 4.36 tokens per second. On the M5 Max, the baseline configuration with 4-bit quantization and no cache-io-split reached 12.48 tok/s. With the optimal --cache-io-split 4 setting, performance increased to 12.99 tok/s, making it three times faster than the original benchmark.
Cache-IO-Split Analysis
The user performed a full sweep of cache-io-split values using the Anemll fork of flash-moe, which adds Metal 4 NAX support for M5+ chips. The results show that splits 2 and 3 degrade performance, while split 4 provides the best optimization:
- cache-io-split 1 (none): 12.48 tok/s, 28.4ms expert I/O per token
- cache-io-split 2: 9.94 tok/s, 28.2ms expert I/O per token
- cache-io-split 3: 9.99 tok/s, 36.1ms expert I/O per token
- cache-io-split 4: 12.99 tok/s, 25.9ms expert I/O per token
- cache-io-split 5: 12.64 tok/s, 27.5ms expert I/O per token
- cache-io-split 8: 12.90 tok/s, 26.4ms expert I/O per token
The analysis suggests that split 4 aligns with the M5 Max SSD controller's internal parallelism, while higher values add scheduling overhead. The recommendation is to use --cache-io-split 4 or no split at all, avoiding splits 2 and 3.
Quantization Comparison
Testing 2-bit versus 4-bit quantization revealed that 2-bit offers no speed advantage on the M5 Max, with SSD speed making smaller files unnecessary and dequantization overhead canceling any gains. Quality suffers significantly with 2-bit:
- 4-bit: 12.99 tok/s, 3.64 perplexity on WikiText-2
- 2-bit: ~12.65 tok/s, 5.71 perplexity on WikiText-2 (57% worse)
The conclusion is to use 4-bit quantization for better quality without sacrificing speed.
Technical Details
The benchmark used the Anemll fork available at https://github.com/Anemll/flash-moe. Sustained performance remained stable at 11.23 tok/s over 1000 tokens with no degradation. The user noted that background processes using Metal/GPU, such as LM Studio, can significantly impact performance and should be closed during benchmarking.
📖 Read the full source: r/LocalLLaMA
👀 See Also

OpenTidy: Open-Source Background Assistant Using Claude Code for Admin Tasks
OpenTidy is an open-source macOS service that spawns persistent Claude Code sessions to handle admin tasks like invoices, forms, and communication triage. It runs up to 10 parallel jobs with Telegram notifications for sensitive actions.

Giving Claude a Local LLM as an Assistant via MCP on Mac
A developer connects Claude to a local Qwen 2.5 Coder 14B via Ollama and MCP, creating a no-cost assistant for delegating tasks like text processing and handling large files.

ModelFitAI: Deploy AI Agents Without VPS Setup, Built with Claude Code
ModelFitAI is a platform that lets developers deploy AI agents directly on its infrastructure, eliminating VPS setup, Docker configuration, and SSH sessions. The entire platform was built using Claude Code by a solo founder.

Open-source pipeline turns Claude Code workflow into reusable skills
A developer who used Claude Code daily for 9 months has open-sourced a pipeline that structures feature development with checkpoints like functional documentation, technical documentation, complexity estimation, and security checks. The pipeline includes /new-feature and /bug-fix entry points that guide implementation.