OSCAR Beats TurboQuant: 2‑Bit KV‑Cache for Fast, Stable Long‑Context Inference
OSCAR presents an attention‑aware rotation scheme that compresses KV caches to true 2‑bit, cutting memory usage by up to 8× and boosting decode throughput by up to 7×, while preserving inference quality within a few points of BF16 across multiple models and long‑context benchmarks, outperforming TurboQuant.
Motivation : Long‑context models suffer from KV‑Cache memory and bandwidth bottlenecks because each generated token must read an ever‑growing history of keys and values. Reducing the KV representation to 2‑bit could shrink memory by roughly eightfold, but naïve quantization often collapses attention quality.
Why 2‑bit KV is hard : INT2 provides only four quantization levels. Outlier channels with large magnitudes dominate the scale, forcing most values into a few levels and causing attention scores to drift. Simple Hadamard rotation flattens outliers but does not consider which directions the model actually attends to.
OSCAR’s core idea : Instead of reconstructing the original K/V vectors, OSCAR rotates them so that the quantization error is pushed onto directions that the attention mechanism is less sensitive to. For keys, the rotation target is derived from the query covariance (QᵀQ); for values, it uses the score‑weighted value covariance (VᵀSᵀSV). The final rotation matrix is R = U · Hadamard · bit‑reversal, where U aligns with attention‑relevant directions, Hadamard spreads outliers, and bit‑reversal balances INT2 groups.
Offline calibration : A small calibration set is used to estimate the attention‑aware covariances for each layer and head. Fixed rotation matrices and clipping thresholds are then generated per layer/head.
Serving pipeline (SGLang integration) : OSCAR maintains three KV segments: a sink (64 tokens) and a recent window (256 tokens) stored in BF16 to protect short‑term context, and the longest history segment stored as rotated INT2. New tokens are first written to the recent window; as decoding progresses, the oldest recent tokens are rotated, clipped, quantized to INT2, and packed (four 2‑bit values per byte) into the history cache via a fused Triton kernel. During decode, an INT2 kernel unpacks, rescales, and accumulates, while a BF16 kernel handles sink/recent tokens; results are merged with an online softmax.
Compatibility : The design works with paged KV, radix‑prefix cache, and SGLang’s fused kernel pipeline, enabling direct use in long‑context workloads without extra engineering.
Evaluation : OSCAR was tested on Qwen3‑4B‑Thinking, Qwen3‑8B, Qwen3‑32B, and GLM‑4.7‑FP8 across GPQA, HumanEval, LiveCodeBench v6, AIME25, and MATH500, with generation lengths up to 32 K and 5 runs per setting. At an effective 2.28 bits per KV element, OSCAR’s scores were within 3.78 points of BF16 on Qwen3‑4B‑Thinking (40.1 point gain over TurboQuant) and within 1.42 points on Qwen3‑8B, while larger models matched BF16. Naïve INT2 and QuaRot‑INT2 often crashed; TurboQuant’s 3‑bit KV fell noticeably behind on reasoning tasks.
Long‑context robustness : In 128 K RULER‑NIAH tests, OSCAR kept stable retrieval performance on Qwen3‑8B and GLM‑4.7‑FP8, demonstrating that attention‑aware rotation mitigates error accumulation over very long histories.
System gains : Compared with BF16 history storage, OSCAR reduces KV memory by ~8×, achieves up to 3× decode speedup in batch‑size‑1 full‑prefix‑cache scenarios, and up to 7× job‑level throughput when batch size grows under a fixed memory budget. Higher prefix‑cache hit rates further expand the throughput frontier.
Conclusion : OSCAR shows that true 2‑bit KV caching can be both memory‑efficient and inference‑stable when the rotation is attention‑aware and backed by a real serving stack. It outperforms TurboQuant on KV quantization and provides a practical path for deploying long‑context LLM agents.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Machine Learning Algorithms & Natural Language Processing
Focused on frontier AI technologies, empowering AI researchers' progress.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
