NexQuant: Rust-native 3-bit KV-cache engine for edge deployment

NexQuant is a Rust-native engine for running high-context models on consumer hardware that would normally struggle with memory constraints. It's positioned as a production-hardened successor to Tom Turney's TurboQuant+ research.
Key technical details
- 3-5x Memory Reduction: 14B models now fit in 4GB of VRAM or unified memory
- MSE-Only Stability: Replaces noisy QJL paths with stable MSE-only trajectory (27/27 logic tests passed)
- Integrated Sparse-V: Sparsity is integrated into the real-time decode loop rather than just being a benchmark feature
- Zero-Alloc Prefill: Written in 100% Safe Rust for speed without C++ prototype segfault issues
- Hardware Support: Native runtime dispatch for Metal, CUDA, and Vulkan, with CPU-AVX2/NEON backend support for older laptops and Raspberry Pi
Implementation specifics
The project uses Walsh-Hadamard Transforms and Rust GGUF parsing. It builds on Tom Turney's PolarQuant/TurboQuant+ breakthroughs that proved 3-bit KV-caches were mathematically possible. The development involved Claude (Anthropic) as a high-speed pair programmer.
The goal is to ensure that as models scale, the ability to run them remains local and decentralized. The team is specifically seeking feedback on Vulkan SPIR-V kernels.
📖 Read the full source: r/LocalLLaMA
👀 See Also

Qwen 3.5 Chat Template Release with 21 Bug Fixes for Agent Workflows
A developer has released a fixed chat template for Qwen 3.5 models, addressing 21 bugs including tool calling crashes, parallel call separation, and agent loop stability. It's a drop-in replacement tested on llama.cpp, Open WebUI, vLLM, and other platforms.

OpenClaw CoreBrain Plugin: Persistent Memory for AI Coding Agents
A new plugin called CoreBrain addresses OpenClaw's memory issues by storing information outside the context window in a knowledge graph and auto-injecting it before every query, eliminating the need for tool calls and optional memory invocation.

memv MCP Server: Persistent Structured Memory for AI Agents
memv, an open-source Python memory layer for agents, now ships with an MCP server. It provides five tools for persistent, structured memory with per-user isolation and LLM-optional extraction.

Forge: A Claude-based IDE with automated verification and project DNA
Forge is a Claude-based IDE built on VS Code that automatically runs type checking, tests, coverage checks, and import validation before showing code. It includes self-healing loops for failed verification and builds a Project DNA of your codebase patterns.