Bonsai 1.7B Hits 442 t/s on M4 Max with Auto-Tuned Metal Kernels

Bonsai 1.7B — a ternary model from PrismML — has been optimized for Apple Silicon using autonomously tuned Metal kernels. The work was performed by ata, an autonomous engineering agent from Agents2Agents, which ran an agentic evolution search for 6 hours to produce custom GPU kernels.

Benchmark Results

Measured against the upstream llama.cpp at the same Bonsai/Q2_0 commit on an M4 Max (same model file, same llama-bench -p 512 -n 128 -r 10 -fa 1 -ngl 99 config):

Decode (tg128): 311.66 → 442.42 t/s (+42.0%)
Prefill (pp512): 4250.32 → 4622.63 t/s (+8.8%)

For context, the Bonsai 8B whitepaper reports MLX-upstream Q2_0 decode at 235 t/s on Apple Silicon. This build achieves 442 t/s on the 1.7B variant via custom Metal kernels (different framework, smaller model — directionally indicative of headroom in the stack).

What's Included

The build is a drop-in optimized inference package for M-series Macs (arm64 only). Inside the 358 MB tar.xz:

chat.sh — interactive REPL
complete.sh — non-interactive completion
bench.sh — reproduce the benchmarks
server.sh — OpenAI-compatible HTTP API on :8080
Bonsai-1.7B-Q2_0.gguf — the model file (442 MB)

Quick Start

tar -xJf bonsai-1.7b-ternary-M4Max.tar.xz
cd bonsai-1.7b-ternary-M4Max
./chat.sh

Technical Details

Every Metal kernel was authored and tuned by ata without human intervention. The work focused on custom GPU kernels at the matvec / FFN / KV-cache layer, shape-specialized for the Bonsai 1.7B Q2_0 decode path. Numerical output matches the reference build (verified top-1 token match). Tested on M4 Max; proportional gains expected on M1+.