TurboQuant: How Google’s New Vector Quantization Cuts KV Memory 6× and Boosts Speed

TurboQuant, presented at ICLR 2026, introduces a theoretically grounded vector quantization technique that reduces large‑language‑model key‑value cache memory by at least six times, achieves up to eight‑fold speedups, and maintains zero accuracy loss by combining PolarQuant’s polar‑coordinate compression with a 1‑bit QJL error‑correction step, as demonstrated on benchmarks such as LongBench and GloVe.

PaperAgent
PaperAgent
PaperAgent
TurboQuant: How Google’s New Vector Quantization Cuts KV Memory 6× and Boosts Speed

Cloudflare co‑founder and CEO praised Google’s recent “DeepSeek moment,” noting that AI inference still has huge optimization potential in speed, memory usage, power consumption, and multi‑tenant utilization. The paper titled TurboQuant proposes a set of theoretically backed quantization algorithms that can compress large‑language‑model (LLM) key‑value (KV) caches by at least six times, deliver up to eight times acceleration, and incur zero accuracy loss, redefining AI efficiency.

Why Vector Quantization Matters

Vectors are the fundamental representation for AI models, with high‑dimensional vectors capturing complex information such as image features or word meanings. Storing these vectors in KV caches creates a memory bottleneck because the cache acts as a fast “digital notepad” for frequently accessed data. Traditional vector quantization reduces vector size but often adds its own memory overhead by storing per‑block quantization constants, partially offsetting the gains.

TurboQuant’s Core Innovations

TurboQuant tackles the overhead challenge with two complementary techniques:

High‑Quality Compression (PolarQuant) : The method first applies a random rotation to the data vectors, simplifying their geometry. Each rotated sub‑vector is then quantized with a standard high‑quality quantizer, capturing the main concepts of the original vector while using most of the available bits.

Eliminating Hidden Error (QJL) : A tiny residual budget (1 bit) is used to apply a Quantized Johnson‑Lindenstrauss (QJL) transform to the remaining error after the first stage. QJL acts as a mathematical error checker, removing bias and producing more accurate attention scores.

QJL: Zero‑Overhead 1‑Bit Trick

QJL leverages the Johnson‑Lindenstrauss Transform to shrink high‑dimensional data while preserving pairwise distances. Each resulting vector element is reduced to a single symbol (+1 or –1), creating a “zero‑memory‑overhead” fast shorthand. A special estimator balances high‑precision queries with the low‑precision representation, allowing the model to compute attention scores accurately.

PolarQuant: A New Angle on Compression

PolarQuant converts Cartesian coordinates to polar coordinates, representing each vector as a radius (data magnitude) and an angle (direction). Because angle distributions are highly concentrated, the model can map data onto a predictable circular grid, eliminating the need for costly normalization steps required by traditional methods. The process groups coordinate pairs, maps them to polar space, and recursively compresses until a single radius and a set of descriptive angles remain.

Experiments and Results

TurboQuant was evaluated on long‑context benchmarks (LongBench, Needle‑In‑A‑Haystack, ZeroSCROLLS, RULER, L‑Eval) using open‑source LLMs (Gemma and Mistral). The results show that TurboQuant achieves optimal dot‑product distortion and recall scores while minimizing KV memory usage. In the “Needle‑In‑A‑Haystack” task, TurboQuant reduced KV memory by at least six times with perfect downstream performance; PolarQuant showed near‑zero loss.

TurboQuant can quantize KV caches down to 3 bits without any training or fine‑tuning, preserving model accuracy and delivering up to 8× runtime speedup on H100 GPUs compared to a 32‑bit unquantized baseline. The implementation adds negligible runtime overhead.

For vector search, TurboQuant was compared against state‑of‑the‑art baselines (PQ, RabbiQ) on the GloVe (d=200) dataset using 1@k recall. Despite baselines relying on large codebooks and dataset‑specific tuning, TurboQuant consistently achieved higher recall, demonstrating robust efficiency for high‑dimensional retrieval tasks.

TurboQuant KV cache compression performance
TurboQuant KV cache compression performance
TurboQuant algorithm illustration
TurboQuant algorithm illustration
KV cache compression benchmark
KV cache compression benchmark
Attention logits speedup
Attention logits speedup
1@k recall on GloVe dataset
1@k recall on GloVe dataset
https://arxiv.org/abs/2504.19874
TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate
AI inferencebenchmarkingmemory compressionvector quantizationTurboQuant
PaperAgent
Written by

PaperAgent

Daily updates, analyzing cutting-edge AI research papers

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.