PrismML's Bonsai 1-bit Qwen models tested: 107 t/s generation on 8GB VRAM

✍️ OpenClawRadar📅 Published: April 5, 2026🔗 Source

Bonsai models: 1-bit Qwen quantization from PrismML

PrismML has released Bonsai, a set of 1-bit quantized versions of Qwen3 models (8B, 4B, and 1.7B parameters). These models use extreme quantization to dramatically reduce memory requirements while maintaining usable performance for certain tasks.

Performance benchmarks from testing

Testing on an RTX 4060 with 8GB VRAM showed:

107 tokens/second generation speed
>1114 tokens/second prompt processing
Significantly lower RAM usage compared to Q4 quantized models

For comparison, Qwen 3.5 4B Q4 achieved 56 t/s using the same prompts on the same hardware.

Practical implications

The reduced memory footprint enables running 8B parameter models on 8GB VRAM systems. Smaller models can be used with longer context windows due to the memory savings.

Quality assessment

Initial testing focused on text summarization, where the model performed well. The tester noted they didn't evaluate coding or tool-using capabilities.

Technical limitations

The current implementation has CPU inference issues. When tested on a GPU-less mini PC:

The llama.cpp fork compiles successfully
The model loads but hangs during prompt processing
Analysis suggests no CPU implementation exists - it likely dequantizes to FP32 and attempts regular inference, which would be extremely slow on CPU

Technical potential

1-bit models could reduce not just bandwidth and memory requirements, but also compute requirements. Matrix multiplication on 1-bit matrices could use XOR operations, which are much faster than floating-point operations. Even with scaling to FP16 after XOR operations, significant compute savings should be possible, potentially benefiting CPU-only inference and edge computing scenarios.

Setup details

The tester downloaded:

The 8B Bonsai model
PrismML's llama.cpp fork
Tested on Windows with CUDA

📖 Read the full source: r/LocalLLaMA

👀 See Also

News

Open-weight models under 100GB can't beat Claude Haiku on coding benchmarks

A comparison of open-weight models on LiveBench and Arena Code/WebDev benchmarks shows no model under 100GB comes close to Claude Haiku 4.5. The nearest competitor is Minimax M2.5 at 136GB, which roughly matches Haiku's performance.

Feb 26, 2026, 04:45 PM UTC

OpenClawRadar

News

Developer Switches from Cursor Composer 2 and Kimi 2.6 to Qwen3.6:35b-a3b for Enterprise Workloads

A developer reports using Qwen3.6:35b-a3b for daily work on a 500-700k LOC enterprise suite, citing better performance than Kimi 2.6 and DeepSeek 4 Pro/Flash, with costs ~$0.08/1M tokens on OpenRouter.

May 17, 2026, 08:17 PM UTC

OpenClawRadar

News

Claude CLI Directive Drift Issue Reported by Developer

A developer reports Claude CLI consistently ignores project directives stored in .claude folder files, particularly after auto-compact operations. The tool runs prohibited background processes and deletes task/session data despite explicit instructions.

Apr 19, 2026, 11:45 PM UTC

OpenClawRadar

News

OpenClaw 2026.3.28: Breaking Changes for MiniMax Users, Config Auto-Repair Removed

OpenClaw 2026.3.28 removes auto-repair for deprecated config keys and eliminates several MiniMax models. Users must update configs before upgrading to avoid gateway startup failures.

Mar 29, 2026, 11:45 PM UTC

OpenClawRadar