vLLM 0.22 Release: Production-Ready DeepSeek V4 and Extreme KV Cache Compression

The vLLM 0.22 stable release introduces production‑grade DeepSeek V4 support, massive kernel fusions, up to 10‑20× speedups, Batch Invariance with 28.9% latency gain, a Rust front‑end, multi‑level KV cache offload that can double context length, and broad hardware coverage across NVIDIA, AMD, CPU and RISC‑V, making it a pivotal upgrade for inference infrastructure teams.

Old Zhang's AI Learning
Old Zhang's AI Learning
Old Zhang's AI Learning
vLLM 0.22 Release: Production-Ready DeepSeek V4 and Extreme KV Cache Compression

vLLM 0.22 Overview

The vLLM 0.22 stable release adds a set of updates and optimizations targeting large‑model inference, especially for DeepSeek V4, KV cache handling, and broader hardware support.

DeepSeek V4: From "Can Run" to Production‑Ready

DeepSeek V4 (1.6 T total parameters, 49 B MoE activation parameters, 1 M‑token context) moves from experimental to production‑grade support. Model code is reorganized into an independent package vllm/models/deepseek_v4/, eliminating generic model‑class overhead.

Kernel Fusion Advances include six fused kernels:

NVFP4 Fused MoE : FP4‑based expert mixing on Blackwell.

MegaMoE kernel : Input preprocessing shifted to GPU, reducing host‑device transfers.

Sparse MLA + Compressor : Optimized sparse path for CSA/HCA attention.

Q‑norm / Indexer fusion : Quantization and indexing combined in one step.

Fused Q‑norm + KV RoPE + K insert : Static warp‑ID dispatch, zero cross‑warp communication, delivering 10‑20× speedup .

Inverse RoPE + fp8 fusion : Eliminates back‑to‑back HBM reads, achieving 2‑3× speedup .

CUDA Graph Full Support : Both Full and Piecewise modes are now supported, virtually removing kernel launch overhead on the decode path.

MTP speculative decoding lands for the first time on V4, further boosting generation speed.

KV Cache Compression

DeepSeek V4 introduces two compression stages: c4a (~4×) and c128a (~128×). In bf16, a 1‑million‑token context requires only 9.62 GiB of KV cache, an 8.7× reduction compared to V3.2’s 83.9 GiB. Adding an FP4 indexer and fp8 attention cache can halve the size again.

DeepSeek V4 vs V3.2 KV Cache Comparison
DeepSeek V4 vs V3.2 KV Cache Comparison

Batch Invariance: 28.9% Latency Improvement

Batch Invariance guarantees identical outputs for the same prompt across different batch configurations, crucial for reproducibility. Previously it incurred a performance penalty, but v0.22 delivers:

Cutlass FP8 path: 28.9% end‑to‑end latency reduction .

CutlassFP8 Padding preprocessing: 13.5% TTFT improvement .

SM80 compile‑mode support for A100 users.

NVFP4 Cutlass Linear: FP4 quantization now supports Batch Invariance.

TRITON_MLA decode full CUDA Graph capture.

Batch Invariance can be enabled by default via:

export VLLM_BATCH_INVARIANT=1
vllm serve meta-llama/Llama-3.1-8B-Instruct

Supported models include DeepSeek V3/R1, the full Qwen3 series, Qwen2.5, Llama 3, and others.

Rust Front‑End: Ending the Python Bottleneck

The experimental Rust front‑end is now merged into the main repository, removing the GIL and async‑dispatch overhead that limited high‑concurrency scenarios.

Code integration : Rust implementation lives in the main repo.

DP Supervisor : Rust‑based supervisor for data‑parallel request distribution.

Build integration : Integrated via setuptools‑rust into the Python build process, transparent to users.

Multi‑Level KV Cache Offload

KV cache management is a core bottleneck for long‑context inference. v0.22 introduces a hierarchical offload framework: GPU HBM → CPU DRAM → File system / Disk Key capabilities:

Unified offload/load API supporting arbitrary level combinations.

Python file‑system second‑level storage persisting KV blocks to disk.

DeepSeek V4‑specific KV layout adaptation.

MooncakeStoreConnector for direct disk writes.

Per‑request KV block lifecycle tracking.

KV Cache Offload TTFT Performance Comparison
KV Cache Offload TTFT Performance Comparison

Loading KV cache from CPU reduces TTFT by 2‑22× (depending on prompt length) and can increase throughput up to 9× . An 8×H100 (640 GB HBM) machine can double or more its effective context length using CPU memory + NVMe offload, with added latency acceptable for prefill‑heavy batch workloads.

Hardware Ecosystem: Vendor‑Agnostic Support

NVIDIA Blackwell (SM12x) :

FlashInfer b12x MoE + FP4 GEMM.

Per‑tensor FP8 CUTLASS (SM121). head_dim=512 support for large‑head models.

GDN Prefill Kernel (SM100/SM120).

AMD ROCm :

DSV4 full functionality with precision fixes and Tilelang MHC.

Flash Sparse MLA Triton kernels.

Gluon Paged MQA (gfx950/MI355X).

RMSNorm + Quant fusion (gfx950).

XGMI high‑speed interconnect.

CPU / RISC‑V :

RISC‑V Vector Extension attention kernels (VLEN=256).

AMX CPU fused GDN.

MXFP4 W4A16 MoE for CPU inference.

Experimental Triton + MRv2 CPU support.

Intel XPU :

INT4 GPTQ, MXFP8 MoE, FP8 block‑scaled quant.

Various sparse attention kernels.

MoE TopK routing with MXFP4 fallback.

Overall, vLLM evolves from an NVIDIA‑centric framework to a full‑hardware inference infrastructure.

Model Runner V2: Gradual Adoption

Oracle mechanism : Automatically selects MRv2 for suitable models (e.g., Qwen3 Dense).

Automatic fallback : Detects KV Connector and reverts to MRv1 without risk.

Sleep Mode : Releases GPU memory when idle, reloads on demand—useful for multi‑model sharing.

Shared KV Cache layer : Reuses KV memory across models.

Other Notable Changes

Quantization ecosystem : MXFP4/NVFP4 fully rolled out; quantization_config refactored to QuantKey with activation‑cover mode for layer‑specific strategies.

De‑aggregation inference : NIXL improvements, GDN supports PD de‑aggregation and multi‑node TP>8 fixes.

LoRA : One‑Shot Triton kernel accelerates MoE LoRA, supporting 2D and 3D adapters.

API extensions : thinking_token_budget and reasoning_effort map to enable_thinking, aligning with OpenAI semantics.

Breaking changes : get_tokenizer removed; MLA prefill parameters deprecated—upgrade checks required.

Upgrade Recommendations

DeepSeek V4 users : Strongly upgrade – first production‑ready version.

Need Batch Invariance : Strongly upgrade – 28.9% latency gain removes accuracy‑speed trade‑off.

Blackwell users : Upgrade recommended – SM12x‑specific optimizations now widely available.

AMD ROCm users : Upgrade recommended – substantial ROCm parity improvements.

Long‑context inference : Evaluate upgrade – multi‑level KV offload dramatically extends effective context.

Stable production : Upgrade cautiously – watch for breaking changes.

Conclusion

The keyword for vLLM 0.22 is maturation . DeepSeek V4 moves from experimental to production, Batch Invariance becomes fast, KV offload expands to multiple levels, and the Rust front‑end progresses from concept to code integration. Horizontally, support expands from NVIDIA‑only to AMD, Intel, CPU, and RISC‑V; vertically, vLLM evolves from a pure inference engine to a complete inference infrastructure.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Inference OptimizationvLLMKV CacheDeepSeek V4Batch InvarianceRust Frontend
Old Zhang's AI Learning
Written by

Old Zhang's AI Learning

AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.