PrismML's Bonsai 1-bit Qwen models tested: 107 t/s generation on 8GB VRAM

Bonsai models: 1-bit Qwen quantization from PrismML
PrismML has released Bonsai, a set of 1-bit quantized versions of Qwen3 models (8B, 4B, and 1.7B parameters). These models use extreme quantization to dramatically reduce memory requirements while maintaining usable performance for certain tasks.
Performance benchmarks from testing
Testing on an RTX 4060 with 8GB VRAM showed:
- 107 tokens/second generation speed
- >1114 tokens/second prompt processing
- Significantly lower RAM usage compared to Q4 quantized models
For comparison, Qwen 3.5 4B Q4 achieved 56 t/s using the same prompts on the same hardware.
Practical implications
The reduced memory footprint enables running 8B parameter models on 8GB VRAM systems. Smaller models can be used with longer context windows due to the memory savings.
Quality assessment
Initial testing focused on text summarization, where the model performed well. The tester noted they didn't evaluate coding or tool-using capabilities.
Technical limitations
The current implementation has CPU inference issues. When tested on a GPU-less mini PC:
- The llama.cpp fork compiles successfully
- The model loads but hangs during prompt processing
- Analysis suggests no CPU implementation exists - it likely dequantizes to FP32 and attempts regular inference, which would be extremely slow on CPU
Technical potential
1-bit models could reduce not just bandwidth and memory requirements, but also compute requirements. Matrix multiplication on 1-bit matrices could use XOR operations, which are much faster than floating-point operations. Even with scaling to FP16 after XOR operations, significant compute savings should be possible, potentially benefiting CPU-only inference and edge computing scenarios.
Setup details
The tester downloaded:
- The 8B Bonsai model
- PrismML's llama.cpp fork
- Tested on Windows with CUDA
📖 Read the full source: r/LocalLLaMA
👀 See Also

Open-weight models under 100GB can't beat Claude Haiku on coding benchmarks
A comparison of open-weight models on LiveBench and Arena Code/WebDev benchmarks shows no model under 100GB comes close to Claude Haiku 4.5. The nearest competitor is Minimax M2.5 at 136GB, which roughly matches Haiku's performance.

Developer Switches from Cursor Composer 2 and Kimi 2.6 to Qwen3.6:35b-a3b for Enterprise Workloads
A developer reports using Qwen3.6:35b-a3b for daily work on a 500-700k LOC enterprise suite, citing better performance than Kimi 2.6 and DeepSeek 4 Pro/Flash, with costs ~$0.08/1M tokens on OpenRouter.

Claude CLI Directive Drift Issue Reported by Developer
A developer reports Claude CLI consistently ignores project directives stored in .claude folder files, particularly after auto-compact operations. The tool runs prohibited background processes and deletes task/session data despite explicit instructions.

OpenClaw 2026.3.28: Breaking Changes for MiniMax Users, Config Auto-Repair Removed
OpenClaw 2026.3.28 removes auto-repair for deprecated config keys and eliminates several MiniMax models. Users must update configs before upgrading to avoid gateway startup failures.