Old Zhang's AI Learning
Old Zhang's AI Learning
Apr 28, 2026 · Artificial Intelligence

vLLM 0.20 Arrives with DeepSeek V4 Support – What’s New?

The vLLM 0.20.0 release dramatically upgrades the inference engine with DeepSeek V4 support, default CUDA 13, PyTorch 2.11, Transformers v5 compatibility, FlashAttention 4 MLA prefill, TurboQuant 2‑bit KV cache, an online quantization front‑end, IR enhancements, Model Runner V2 features, and a slew of new models, while providing detailed installation and upgrade guidance.

CUDA 13DeepSeek V4FlashAttention
0 likes · 10 min read
vLLM 0.20 Arrives with DeepSeek V4 Support – What’s New?
Architect's Must-Have
Architect's Must-Have
Apr 19, 2026 · Artificial Intelligence

TurboQuant: Google’s 6× KV Compression & 8× Speedup Break the AI Memory Wall

With LLM context windows soaring to millions of tokens, the KV‑cache memory wall threatens scalable inference; Google’s TurboQuant tackles this by compressing KV data up to six‑fold without precision loss and accelerating attention up to eight‑fold, using PolarQuant and 1‑bit QJL techniques, reshaping hardware costs and edge AI possibilities.

AI inferenceKV compressionTurboQuant
0 likes · 25 min read
TurboQuant: Google’s 6× KV Compression & 8× Speedup Break the AI Memory Wall
Machine Heart
Machine Heart
Apr 1, 2026 · Artificial Intelligence

TurboQuant’s Alleged Misconduct: Google’s Reply Sparks Bigger Controversy

The TurboQuant paper on LLM quantization has ignited a heated debate over alleged academic misconduct, with the authors’ OpenReview rebuttal drawing criticism for downplaying prior work, misrepresenting benchmarks, and prompting broader concerns about research integrity in AI.

AI research integrityLLM quantizationRaBitQ
0 likes · 9 min read
TurboQuant’s Alleged Misconduct: Google’s Reply Sparks Bigger Controversy
AI Code to Success
AI Code to Success
Mar 27, 2026 · Artificial Intelligence

How Google’s TurboQuant Cuts LLM Memory by 6× and Speeds Up Inference 8×

Google Research’s TurboQuant algorithm compresses large‑language‑model KV caches from 32‑bit to 3‑bit, achieving a six‑fold reduction in memory usage and an eight‑fold inference speedup on H100 GPUs while preserving 100 % accuracy, and it also improves vector search performance without requiring large codebooks.

AI EfficiencyInference AccelerationLLM compression
0 likes · 10 min read
How Google’s TurboQuant Cuts LLM Memory by 6× and Speeds Up Inference 8×
SuanNi
SuanNi
Mar 26, 2026 · Artificial Intelligence

TurboQuant: Google’s 6× KV Cache Compression With Zero Accuracy Loss

TurboQuant, a new technique from Google Research, dramatically compresses key‑value caches by up to six times without precision loss, using PolarQuant and QJL algorithms to transform vectors into polar coordinates and apply quantized Johnson‑Lindenstrauss transforms, thereby boosting inference speed and enabling longer context handling for large language models.

AI compressionKV cachePerformance
0 likes · 13 min read
TurboQuant: Google’s 6× KV Cache Compression With Zero Accuracy Loss
PaperAgent
PaperAgent
Mar 26, 2026 · Artificial Intelligence

TurboQuant: How Google’s New Vector Quantization Cuts KV Memory 6× and Boosts Speed

TurboQuant, presented at ICLR 2026, introduces a theoretically grounded vector quantization technique that reduces large‑language‑model key‑value cache memory by at least six times, achieves up to eight‑fold speedups, and maintains zero accuracy loss by combining PolarQuant’s polar‑coordinate compression with a 1‑bit QJL error‑correction step, as demonstrated on benchmarks such as LongBench and GloVe.

AI inferenceTurboQuantbenchmarking
0 likes · 10 min read
TurboQuant: How Google’s New Vector Quantization Cuts KV Memory 6× and Boosts Speed