TurboQuant: Google’s 6× KV Cache Compression With Zero Accuracy Loss
TurboQuant, a new technique from Google Research, dramatically compresses key‑value caches by up to six times without precision loss, using PolarQuant and QJL algorithms to transform vectors into polar coordinates and apply quantized Johnson‑Lindenstrauss transforms, thereby boosting inference speed and enabling longer context handling for large language models.
Background: Memory Bottlenecks in Large Language Models
Large language models store and process massive numbers of high‑dimensional vectors. Each vector can contain hundreds or thousands of floating‑point numbers, and during inference a KV (key‑value) cache holds these vectors for rapid access. As the context length grows, the KV cache expands, quickly exhausting GPU memory and becoming a critical performance bottleneck.
Traditional Vector Quantization and Its Limitations
Vector quantization reduces the size of high‑dimensional vectors but typically introduces extra memory overhead for quantization constants, often adding 1–2 bits per value. This overhead can offset the intended space savings, especially when many small data blocks must store their own constants.
TurboQuant Overview
TurboQuant, presented by Google Research at ICLR, combines two novel steps to achieve near‑lossless compression of KV caches and high‑dimensional vectors:
PolarQuant : Randomly rotates vectors, then converts Cartesian coordinates to polar coordinates. This representation separates each vector into a radius (intensity) and an angle (direction), allowing the angle to be stored on a fixed circular grid without additional quantization constants.
QJL (Quantized Johnson‑Lindenstrauss) : Applies a quantized Johnson‑Lindenstrauss transform to the residual error from PolarQuant, using only a single bit per value (±1). An estimator balances high‑precision queries with the low‑precision representation, preserving attention‑score accuracy.
Both steps are applied without any model fine‑tuning or additional training.
Technical Details of PolarQuant
PolarQuant groups the dimensions of a d‑dimensional vector into pairs, maps each pair to polar coordinates, and recursively processes the radii. The resulting representation consists of a final radius and a set of angles that can be stored on a predictable circular lattice, eliminating the need for per‑block quantization tables.
Technical Details of QJL
QJL uses the Johnson‑Lindenstrauss lemma to embed high‑dimensional data into a lower‑dimensional space while approximately preserving pairwise distances. Each embedded value is quantized to a single bit, achieving zero additional memory cost. A specialized estimator corrects the small distortion introduced by this extreme quantization, ensuring accurate attention‑score computation.
Experimental Evaluation
TurboQuant was evaluated on several long‑context benchmarks (LongBench, Needle In A Haystack, ZeroSCROLLS, RULER, L‑Eval) using open‑source LLMs such as Gemma and Mistral. Metrics included inner‑product distortion and recall@k for vector search.
KV cache memory usage was reduced by at least 6× with no measurable loss in model accuracy.
TurboQuant achieved 3‑bit quantization of KV caches without training, while still outperforming baseline methods like KIVI.
On H100 GPUs, 4‑bit TurboQuant delivered up to an 8× speedup over the 32‑bit unquantized baseline.
In high‑dimensional vector search on the GloVe‑200 dataset, TurboQuant consistently outperformed Product Quantization (PQ) and RabbiQ in recall@k, despite not requiring dataset‑specific tuning.
Implications
The ability to compress KV caches dramatically extends the effective context window of large language models, enabling longer documents and sustained dialogues without sacrificing latency. Moreover, the data‑agnostic nature of TurboQuant makes it suitable for large‑scale vector search engines, offering near‑optimal distortion rates with minimal memory footprint.
Overall, TurboQuant represents a significant step toward more efficient AI systems, combining extreme compression with zero accuracy loss and minimal runtime overhead.
SuanNi
A community for AI developers that aggregates large-model development services, models, and compute power.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
