vLLM 0.20 Arrives with DeepSeek V4 Support – What’s New?
The vLLM 0.20.0 release dramatically upgrades the inference engine with DeepSeek V4 support, default CUDA 13, PyTorch 2.11, Transformers v5 compatibility, FlashAttention 4 MLA prefill, TurboQuant 2‑bit KV cache, an online quantization front‑end, IR enhancements, Model Runner V2 features, and a slew of new models, while providing detailed installation and upgrade guidance.
Overview
vLLM 0.20.0 is a major, aggressive release with 752 commits and 320 contributors, many of whom are new. The author highlights the most noteworthy changes.
Key Highlights
DeepSeek V4 first‑class support – Immediate initialization support (issue #40860) and fixes for token leakage (#40806), illegal DSA + MTP access (#40772), and silu clamp limits on shared experts.
CUDA 13.0 as default – Both the PyPI CUDA wheel and the vllm/vllm-openai:v0.20.0 Docker image now target CUDA 13.0, following PyTorch 2.11’s upgrade. Users on CUDA 12.9 should install with uv and the flag --torch-backend=cu129.
PyTorch 2.11 and Python 3.14 support – The engine now runs on torch 2.11 (issues #34644, #37947) and officially supports Python 3.14 (issue #34770). This is a breaking change; a clean environment is recommended.
Transformers v5 compatibility – vLLM now works with transformers>=5, fixing issues for visual encoders, PaddleOCR, Mistral YaRN, Jina ColBERT, etc. (issue #30566).
FlashAttention 4 with MLA prefill enabled – FA4 becomes the default MLA prefill backend (issue #38819) and adds support for head‑dim 512 + paged‑KV on SM90+ (issue #38835), yielding visible prefill speedups for DeepSeek‑style models.
TurboQuant 2‑bit KV cache – A new attention backend compresses KV cache to 2‑bit, effectively quadrupling capacity (issue #38479) and integrates with FA3/FA4 prefill (issue #40092). This reduces memory pressure for 32K/128K context lengths.
Online quantization front‑end – End‑to‑end online quantization is now functional (issue #38138) with documentation updates (issue #39736). The experts_int8 path merges into the FP8 online path (issue #38463) and MXFP8 moves to the new front‑end (issue #40152), allowing models to be quantized at load time.
vLLM IR prototype – Added IR skeleton, rms_norm operator (issue #33825), OOT kernel import hooks (issue #38807), and Gemma RMS‑norm migration (issue #39014), together with test and benchmark scaffolding (issue #40167). This paves the way for more decoupled kernel work and easier integration with domestic hardware.
Model Runner V2 progress – Features include Eagle prefill full‑CUDA graph (issue #37588), automatic cudagraph mode selection based on attention backend (issue #32936), fused probability‑reject sampling kernel (issue #38496), multi‑prompt log‑probs (issue #39937), and a fix for precision regression (issue #39833). The MRV2 line is maturing.
New model support – Immediate support for DeepSeek V4, Hunyuan v3 preview, Granite 4.1 Vision, EXAONE‑4.5, Phi‑4‑reasoning‑vision‑15B, jina‑reranker‑v3, Jina Embeddings v5, Nemotron‑v3 VL Nano/Super, among others.
Installation
Recommended installation with uv for stability: uv pip install vllm==0.20.0 If the system uses CUDA 12.9:
uv pip install vllm==0.20.0 --torch-backend=cu129Docker image:
docker pull vllm/vllm-openai:v0.20.0Usage
Run a DeepSeek model (OpenAI‑compatible API):
vllm serve deepseek-ai/DeepSeek-V3.2 \
--tensor-parallel-size 8 \
--max-model-len 32768Enable the experimental 2‑bit KV cache:
vllm serve <model> \
--kv-cache-dtype turboquant \
--max-model-len 131072Online quantization without pre‑converting weights:
vllm serve meta-llama/Llama-3.1-70B-Instruct \
--quantization fp8Practical Recommendations
Upgrade if you need to run new models such as DeepSeek V4, Hunyuan v3, or Gemma 4 – v0.20.0 offers the best performance.
For long‑context workloads, try the 2‑bit KV cache to save memory and fit larger models.
Deploy on domestic hardware (Huawei Ascend, AMD MI300, Intel XPU) – the new IR and ROCm/XPU fixes address many previous issues.
When to Hold Off
Stable production environments on v0.19.x may wait for the v0.20.1 patch.
Systems still on CUDA 12.x should either upgrade CUDA or use the --torch-backend=cu129 flag.
Python versions older than 3.12 may need an upgrade or a compatible wheel.
Interesting Details
Ray is no longer a default dependency (removed in v0.18.0); install manually if needed.
Memory profiling for CUDAGraph is enabled by default (issue #38284), giving clearer GPU usage at the cost of a slightly slower start‑up.
DBO micro‑batch optimizations from v0.19.0 are further generalized, and v0.20.0 adds extensive MoE refactoring for higher throughput.
Conclusion
vLLM 0.20.0 is a watershed release that aligns CUDA 13, PyTorch 2.11, and Transformers v5, requiring a fresh environment but delivering DeepSeek V4 support, 2‑bit KV cache, default FlashAttention 4 MLA prefill, and a full online quantization stack – a half‑year of deployment advantage for those who upgrade.
Old Zhang's AI Learning
AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
