Autoresearch Pushes Qwen3.5-397B to 20.34 tok/s on M5 Max via SSD Streaming

Hardware and Model Configuration
The experiment was conducted on a MacBook Pro M5 Max with 128GB unified memory and a 40-core GPU. The model used was Qwen3.5-397B-A17B with Q3-GGUF experts (Unsloth IQ3_XXS/IQ4_XS mixed precision), Q8_0 embedding, and Q6_K LM head. The model occupies 209GB on disk—4x larger than the available RAM—requiring everything to stream from SSD.
Performance Results
Decode speed reached 20.34 tok/s with prefill at 5.52 tok/s. This represents a 2x improvement over the M5 Max starting point of 10.61 tok/s and a 4.67x improvement over Dan Woods' original baseline of 4.36 tok/s on M3 Max hardware.
Methodology
The researcher used the autoresearch loop methodology from Dan Woods' flash-moe project, running it with Claude Code (Anthropic) to systematically execute and evaluate 36 experiments. Each experiment was logged with results before proceeding, with automatic quality gating via perplexity thresholds to catch regressions. Human-AI collaboration involved the researcher directing the research and making scientific decisions while Claude Code implemented and benchmarked under direction.
Technical Foundation
The work builds on Dan Woods' original flash-moe paper and Anemll's fork, which is a pure C/Metal inference engine for running Qwen3.5-397B via SSD streaming on Apple Silicon. The Anemll fork added Q3-GGUF expert support essential to these results, with the researcher adding further Metal-level optimizations.
Effective Optimizations
- 16 IO threads + cache-io-split=4: Instead of reading each expert weight file as one sequential chunk, split into 4 parallel page-aligned reads hitting different SSD channels simultaneously. +1.5 tok/s
- Temporal expert prediction: Discovered 27% cross-token routing correlation, overlapping SSD reads with GPU compute. +4.3 tok/s
- Q3-GGUF experts (Unsloth IQ3_XXS/IQ4_XS): Smaller payload with Q3 as the sweet spot. Better perplexity than 4-bit (5.58 vs 5.62) while being 23% smaller. +2.3 tok/s
- CMD2 pre-encode: Eliminate 30μs per-layer submission gap. +0.44 tok/s
- Fused Q/K/V projection kernel: Read input vector once instead of three times (Metal GPU optimization). +0.76 tok/s
- CMD2 pre-encode extended to all full-attention layers: +0.47 tok/s
Note: Gains are not perfectly additive since some optimizations interact with each other.
Failed Approaches
The research had a 78% discard rate. Failed approaches included: 1-bit QJL quantization (perplexity 5647, catastrophic), ternary 2-bit with 84% weight sparsity (model collapsed), K=3 expert routing (quality collapse), cross-layer prediction (0% hit rate), NAX offloading (tile padding overhead cancelled gains), and 2-bit MLX experts (faster in isolation but worse perplexity and no speed advantage once temporal prediction was applied to Q3).
Limitations and Future Work
The research is limited to a single hardware platform, so results may not generalize. Q3 quantization at this scale degrades noticeably on long-form generation, producing artifacts on longer responses despite acceptable quality for short tasks. Quality was evaluated via perplexity only, not standardized benchmarks like MMLU or GPQA. This is a speed research project, not a production quality claim.
One surprising finding: Apple's Neural Engine (ANE) was completely idle during inference, drawing 0W despite offering 38 TOPS of compute. The problem is that MoE inference needs to decide which experts to activate dynamically, while ANE only works with static pre-compiled graphs. There may be an opportunity for batch prefill.
📖 Read the full source: r/LocalLLaMA
👀 See Also

SubQ: First Fully Subquadratic LLM with 12M-Token Context and 95% RULER Accuracy
Subquadratic launches SubQ 1M-Preview, a subquadratic LLM with linear compute scaling, 12M-token context, 52× faster sparse attention vs FlashAttention, and 95% on RULER 128K. Available via API, CLI code agent (SubQ Code), and search tool (SubQ Search).

AI Agent Runs Physical Retail Store with Human Employees
Andon Labs deployed an AI named Luna to manage a 3-year retail lease in San Francisco. Luna hired human employees, managed contractors, and made all operational decisions for Andon Market.

OpenClaw Hosts Its First AMA: Insights into AI Coding Agents
OpenClaw, a prominent figure in AI coding agents, hosted its first AMA on Reddit. The discussion shed light on its impacts, future plans, and challenges.

ClawbBot Community Discusses Potential Interface Improvements
The ClawbBot community is actively exploring ideas for enhancing its interface, focusing on boosting user experience and functionality. The discussion ignites promising innovations in the realm of AI coding agents.