Bonsai 1.7B Ternary Model Hits 442 T/s on M4 Max with Autonomously Tuned Metal Kernels

Bonsai 1.7B — a ternary model from PrismML — has been optimized for Apple Silicon using autonomously tuned Metal kernels. The work was performed by ata, an autonomous engineering agent from Agents2Agents, which ran an agentic evolution search for 6 hours to produce custom GPU kernels.
Benchmark Results
Measured against the upstream llama.cpp at the same Bonsai/Q2_0 commit on an M4 Max (same model file, same llama-bench -p 512 -n 128 -r 10 -fa 1 -ngl 99 config):
- Decode (tg128): 311.66 → 442.42 t/s (+42.0%)
- Prefill (pp512): 4250.32 → 4622.63 t/s (+8.8%)
For context, the Bonsai 8B whitepaper reports MLX-upstream Q2_0 decode at 235 t/s on Apple Silicon. This build achieves 442 t/s on the 1.7B variant via custom Metal kernels (different framework, smaller model — directionally indicative of headroom in the stack).
What's Included
The build is a drop-in optimized inference package for M-series Macs (arm64 only). Inside the 358 MB tar.xz:
chat.sh— interactive REPLcomplete.sh— non-interactive completionbench.sh— reproduce the benchmarksserver.sh— OpenAI-compatible HTTP API on :8080Bonsai-1.7B-Q2_0.gguf— the model file (442 MB)
Quick Start
tar -xJf bonsai-1.7b-ternary-M4Max.tar.xz
cd bonsai-1.7b-ternary-M4Max
./chat.shTechnical Details
Every Metal kernel was authored and tuned by ata without human intervention. The work focused on custom GPU kernels at the matvec / FFN / KV-cache layer, shape-specialized for the Bonsai 1.7B Q2_0 decode path. Numerical output matches the reference build (verified top-1 token match). Tested on M4 Max; proportional gains expected on M1+.
Caveats
- Apple Silicon only (arm64) — no Intel Mac or CPU-only builds.
- Numbers from M4 Max; M1/M2/M3 will be lower due to less memory bandwidth.
- Model is Q2_0 quantized — small accuracy delta vs F16.
📖 Read the full source: HN AI Agents
👀 See Also

From Prompting to Specification Engineering: The Planner-Worker Architecture Shift
AI development is shifting from simple chat-based prompting to a planner-worker architecture where humans act as specification engineers. This requires defining strict acceptance criteria, constraint architecture, and decomposition patterns for autonomous AI agents.

ETH Zurich Study Questions Value of AGENTS.md Files for AI Coding Agents
New research from ETH Zurich finds LLM-generated AGENTS.md files reduce AI agent task success by 3% and increase inference costs by over 20%, while human-written files offer only marginal 4% gains with similar cost increases.

Quumble Convergence Protocol v5: Cross-Architecture LLM Experiment Results
The Quumble Convergence Protocol v5 tests whether independent LLM instances converge on descriptions of imaginary creatures when given nonsense words. Results show both Claude (Opus 4.6 & Sonnet 4.6) and GPT-5.3 independently produced a small, round, soft, lavender-tinted, bioluminescent creature that hums from the word 'quumble'.

Claude Code Opus 4.6 Now Defaults to 1M Token Context Window
Claude Code's Opus 4.6 model now comes with a 1 million token context window by default, maintaining the same pricing as previous versions. This change appears to be live without an official announcement.