vLLM 0.22 Release: Production-Ready DeepSeek V4 and Extreme KV Cache Compression
The vLLM 0.22 stable release introduces production‑grade DeepSeek V4 support, massive kernel fusions, up to 10‑20× speedups, Batch Invariance with 28.9% latency gain, a Rust front‑end, multi‑level KV cache offload that can double context length, and broad hardware coverage across NVIDIA, AMD, CPU and RISC‑V, making it a pivotal upgrade for inference infrastructure teams.
vLLM 0.22 Overview
The vLLM 0.22 stable release adds a set of updates and optimizations targeting large‑model inference, especially for DeepSeek V4, KV cache handling, and broader hardware support.
DeepSeek V4: From "Can Run" to Production‑Ready
DeepSeek V4 (1.6 T total parameters, 49 B MoE activation parameters, 1 M‑token context) moves from experimental to production‑grade support. Model code is reorganized into an independent package vllm/models/deepseek_v4/, eliminating generic model‑class overhead.
Kernel Fusion Advances include six fused kernels:
NVFP4 Fused MoE : FP4‑based expert mixing on Blackwell.
MegaMoE kernel : Input preprocessing shifted to GPU, reducing host‑device transfers.
Sparse MLA + Compressor : Optimized sparse path for CSA/HCA attention.
Q‑norm / Indexer fusion : Quantization and indexing combined in one step.
Fused Q‑norm + KV RoPE + K insert : Static warp‑ID dispatch, zero cross‑warp communication, delivering 10‑20× speedup .
Inverse RoPE + fp8 fusion : Eliminates back‑to‑back HBM reads, achieving 2‑3× speedup .
CUDA Graph Full Support : Both Full and Piecewise modes are now supported, virtually removing kernel launch overhead on the decode path.
MTP speculative decoding lands for the first time on V4, further boosting generation speed.
KV Cache Compression
DeepSeek V4 introduces two compression stages: c4a (~4×) and c128a (~128×). In bf16, a 1‑million‑token context requires only 9.62 GiB of KV cache, an 8.7× reduction compared to V3.2’s 83.9 GiB. Adding an FP4 indexer and fp8 attention cache can halve the size again.
Batch Invariance: 28.9% Latency Improvement
Batch Invariance guarantees identical outputs for the same prompt across different batch configurations, crucial for reproducibility. Previously it incurred a performance penalty, but v0.22 delivers:
Cutlass FP8 path: 28.9% end‑to‑end latency reduction .
CutlassFP8 Padding preprocessing: 13.5% TTFT improvement .
SM80 compile‑mode support for A100 users.
NVFP4 Cutlass Linear: FP4 quantization now supports Batch Invariance.
TRITON_MLA decode full CUDA Graph capture.
Batch Invariance can be enabled by default via:
export VLLM_BATCH_INVARIANT=1
vllm serve meta-llama/Llama-3.1-8B-InstructSupported models include DeepSeek V3/R1, the full Qwen3 series, Qwen2.5, Llama 3, and others.
Rust Front‑End: Ending the Python Bottleneck
The experimental Rust front‑end is now merged into the main repository, removing the GIL and async‑dispatch overhead that limited high‑concurrency scenarios.
Code integration : Rust implementation lives in the main repo.
DP Supervisor : Rust‑based supervisor for data‑parallel request distribution.
Build integration : Integrated via setuptools‑rust into the Python build process, transparent to users.
Multi‑Level KV Cache Offload
KV cache management is a core bottleneck for long‑context inference. v0.22 introduces a hierarchical offload framework: GPU HBM → CPU DRAM → File system / Disk Key capabilities:
Unified offload/load API supporting arbitrary level combinations.
Python file‑system second‑level storage persisting KV blocks to disk.
DeepSeek V4‑specific KV layout adaptation.
MooncakeStoreConnector for direct disk writes.
Per‑request KV block lifecycle tracking.
Loading KV cache from CPU reduces TTFT by 2‑22× (depending on prompt length) and can increase throughput up to 9× . An 8×H100 (640 GB HBM) machine can double or more its effective context length using CPU memory + NVMe offload, with added latency acceptable for prefill‑heavy batch workloads.
Hardware Ecosystem: Vendor‑Agnostic Support
NVIDIA Blackwell (SM12x) :
FlashInfer b12x MoE + FP4 GEMM.
Per‑tensor FP8 CUTLASS (SM121). head_dim=512 support for large‑head models.
GDN Prefill Kernel (SM100/SM120).
AMD ROCm :
DSV4 full functionality with precision fixes and Tilelang MHC.
Flash Sparse MLA Triton kernels.
Gluon Paged MQA (gfx950/MI355X).
RMSNorm + Quant fusion (gfx950).
XGMI high‑speed interconnect.
CPU / RISC‑V :
RISC‑V Vector Extension attention kernels (VLEN=256).
AMX CPU fused GDN.
MXFP4 W4A16 MoE for CPU inference.
Experimental Triton + MRv2 CPU support.
Intel XPU :
INT4 GPTQ, MXFP8 MoE, FP8 block‑scaled quant.
Various sparse attention kernels.
MoE TopK routing with MXFP4 fallback.
Overall, vLLM evolves from an NVIDIA‑centric framework to a full‑hardware inference infrastructure.
Model Runner V2: Gradual Adoption
Oracle mechanism : Automatically selects MRv2 for suitable models (e.g., Qwen3 Dense).
Automatic fallback : Detects KV Connector and reverts to MRv1 without risk.
Sleep Mode : Releases GPU memory when idle, reloads on demand—useful for multi‑model sharing.
Shared KV Cache layer : Reuses KV memory across models.
Other Notable Changes
Quantization ecosystem : MXFP4/NVFP4 fully rolled out; quantization_config refactored to QuantKey with activation‑cover mode for layer‑specific strategies.
De‑aggregation inference : NIXL improvements, GDN supports PD de‑aggregation and multi‑node TP>8 fixes.
LoRA : One‑Shot Triton kernel accelerates MoE LoRA, supporting 2D and 3D adapters.
API extensions : thinking_token_budget and reasoning_effort map to enable_thinking, aligning with OpenAI semantics.
Breaking changes : get_tokenizer removed; MLA prefill parameters deprecated—upgrade checks required.
Upgrade Recommendations
DeepSeek V4 users : Strongly upgrade – first production‑ready version.
Need Batch Invariance : Strongly upgrade – 28.9% latency gain removes accuracy‑speed trade‑off.
Blackwell users : Upgrade recommended – SM12x‑specific optimizations now widely available.
AMD ROCm users : Upgrade recommended – substantial ROCm parity improvements.
Long‑context inference : Evaluate upgrade – multi‑level KV offload dramatically extends effective context.
Stable production : Upgrade cautiously – watch for breaking changes.
Conclusion
The keyword for vLLM 0.22 is maturation . DeepSeek V4 moves from experimental to production, Batch Invariance becomes fast, KV offload expands to multiple levels, and the Rust front‑end progresses from concept to code integration. Horizontally, support expands from NVIDIA‑only to AMD, Intel, CPU, and RISC‑V; vertically, vLLM evolves from a pure inference engine to a complete inference infrastructure.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Old Zhang's AI Learning
AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
