Google’s TurboQuant Cuts KV‑Cache Memory 8× and Boosts LLM Inference Speed
Google’s TurboQuant reduces KV‑Cache memory by up to 4.6×, speeds 3‑bit attention computation up to 8× on H100, and delivers near‑zero accuracy loss across long‑context benchmarks, with open‑source implementations for Metal, vLLM and llama.cpp.
What does TurboQuant do?
Google released TurboQuant, a new quantization technique that turns the KV cache – the most memory‑intensive and slow component of LLM inference – into an almost free resource.
The KV‑Cache bottleneck
During inference, the KV cache consumes the majority of GPU memory; longer contexts cause the cache to balloon, limiting the size of models that can run locally.
Limitations of previous quantization
Earlier quantization schemes either sacrifice precision or add extra memory overhead for storing quantization constants, providing little net benefit.
TurboQuant’s breakthrough
TurboQuant achieves 3‑bit KV‑Cache quantization with zero quality loss and even faster computation.
How TurboQuant works
Step 1 – PolarQuant : Vectors are randomly rotated so that their distribution becomes highly concentrated. The data are then expressed in polar coordinates, eliminating per‑block scaling factors and removing the quantization overhead entirely.
Step 2 – QJL (1‑bit residual cleaning) : After PolarQuant, a 1‑bit Johnson‑Lindenstrauss transform removes residual errors, preserving inner‑product (attention) calculations.
Combined, PolarQuant + QJL incur zero extra memory cost and approach the information‑theoretic lower bound; the paper shows TurboQuant is within a 2.7× constant factor of the theoretical optimum.
Benchmark results
Google evaluated TurboQuant on Gemma and Mistral using LongBench, Needle‑in‑a‑Haystack, ZeroSCROLLS, RULER, and L‑Eval. TurboQuant achieved perfect scores on all tasks while shrinking KV memory by at least 6×.
On an H100 GPU, 4‑bit TurboQuant attention runs 8× faster than the original 32‑bit key implementation.
Community implementations
The mlx‑vlm project provides a Metal kernel implementation of TurboQuant, including kernels _mse_score_kernel, _pack_lowbit_kernel, _unpack_lowbit_kernel, _qjl_score_kernel, _prod_score_kernel, _polar_prod_score_kernel, _polar_turbo_score_repeat_kernel, and multi‑head kernels _prod_score_multi_kernel, _mse_weighted_rot_multi_kernel. Tests on Qwen 3.5‑35B‑A3B show 4.9× KV‑cache reduction at 2.5‑bit and 3.8× at 3.5‑bit with no accuracy loss.
The MLX kernel reaches 85‑70% of full‑precision decode speed (54 tok/s vs 62.5 tok/s on an 8K prompt) and continues to improve.
llama.cpp integration
Developer TheTom created turboquant_plus, porting TurboQuant to llama.cpp with Metal kernels, enabling end‑to‑end inference on Apple Silicon.
f16 cache: 1.0× compression, PPL 6.121
q8_0 cache: 2.0× compression, 2694 tok/s, PPL 5.414
q4_0 cache: 4.0× compression, PPL 6.142
turbo3 cache: 4.6× compression, 2747 tok/s, PPL 5.460
TurboQuant 3‑bit KV‑Cache yields a 4.6× compression factor, is slightly faster than q8_0, and increases perplexity by only 0.8%.
# Run inference with TurboQuant 3‑bit KV Cache
./build/bin/llama-server \
-m models/your-model.gguf \
--cache-type-k turbo3 --cache-type-v turbo3 \
-ngl 99 -c 262144 -fa on \
--host 0.0.0.0 --port 8080Only the two parameters --cache-type-k turbo3 and --cache-type-v turbo3 are required; no other changes are needed. The approach works on Qwen 3.5‑35B‑A3B MoE models across context lengths from 2K to 32K, maintaining ~99% of q8_0 speed.
vLLM integration
Developer Mitko Vasilev added TurboQuant to vLLM. On an HP ZGX device with a GB10 GPU, the implementation handled 4,083,072 KV‑cache tokens, demonstrating the feasibility of massive context windows.
What does this mean?
The author considers TurboQuant the most important LLM inference advance of 2026, comparable to FlashAttention’s reduction of attention complexity.
Developers can run larger models and longer contexts on the same GPU, dramatically lowering cost.
No retraining or fine‑tuning is required; deployment is as simple as a pip install.
Mac users see the usable model range of a 16 GB Mac Mini double, and MLX kernels bring local inference speed close to full precision.
Cloud providers can reduce inference costs, potentially lowering API prices and enabling more applications.
The AI ecosystem’s hardware ceiling is raised, allowing continued scaling of parameters, MoE, and context length without prohibitive memory constraints.
Google has open‑sourced the paper (arXiv 2504.19874) and the code, allowing the community to freely reproduce the results.
Old Zhang's AI Learning
AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
