Google Research introduces TurboQuant for AI model compression

What TurboQuant does
TurboQuant is a set of advanced quantization algorithms that enable massive compression for large language models and vector search engines. It specifically addresses bottlenecks in the key-value cache - a high-speed storage system that stores frequently used information under simple labels for instant retrieval.
How it works
TurboQuant achieves high reduction in model size with zero accuracy loss through two key steps:
- High-quality compression (PolarQuant method): Starts by randomly rotating data vectors to simplify geometry, then applies a standard quantizer to each part of the vector individually. This stage uses most of the compression power to capture the main concept and strength of the original vector.
- Eliminating hidden errors: Uses a small residual amount of compression power (just 1 bit) to apply the QJL algorithm to the tiny amount of error left over from the first stage. QJL acts as a mathematical error-checker that eliminates bias, leading to more accurate attention scores.
Key components
QJL (Quantized Johnson-Lindenstrauss): Uses the Johnson-Lindenstrauss Transform to shrink high-dimensional data while preserving distances between data points. It reduces each resulting vector number to a single sign bit (+1 or -1) with zero memory overhead. Uses a special estimator that balances high-precision queries with low-precision data to accurately calculate attention scores.
PolarQuant: Addresses memory overhead by converting vectors into polar coordinates using a Cartesian coordinate system. Instead of standard coordinates (X, Y, Z), it uses a format comparable to "Go 5 blocks total at a 37-degree angle" rather than "Go 3 blocks East, 4 blocks North."
Technical context
Traditional vector quantization typically introduces memory overhead of 1-2 extra bits per number due to storing quantization constants for every small data block. TurboQuant optimally addresses this challenge. The techniques showed promise in testing for reducing key-value bottlenecks without sacrificing AI model performance.
TurboQuant will be presented at ICLR 2026, while PolarQuant will be presented at AISTATS 2026.
📖 Read the full source: HN AI Agents
👀 See Also

Cortex v1.2 adds LLM enrichment, Q&A with citations, and conflict resolution
Cortex, a local memory layer for OpenClaw agents, has released v1.2 with LLM-augmented enrichment by default, a question-answering command with citations, and improved deduplication and conflict resolution. The tool now includes unified configuration management and intent-based search pre-filtering.

Automating Datadog Alert Triage with Claude Code and MCP
A developer built a system using Claude Code skills and Datadog's MCP server to automatically check monitoring alerts, classify issues, and open fix PRs via cron job. The setup takes about 30 minutes and runs parallel AI agents in isolated worktrees.

Ruflo: Open-Source Platform for Running Multiple AI Agents as a Team
Ruflo is an open-source platform that lets you run many AI agents together to work as a team on complex tasks. Previously known as Claude Flow, it helps coordinate workflows where tasks need to be broken into parts.

Open-source Claude Code plugin simulates Chief Data & AI Office with 22 specialized agents
An open-source Claude Code plugin called AI CDAIO Office uses 22 specialized AI agents to simulate a complete Chief Data & AI Office, generating actual PPTX, DOCX, and XLSX files for strategy documents, governance frameworks, and board materials.