vLLM 0.19.0: HuggingFace v5 Support, Multimodal Boosts, and CPU KV Cache Offload

The vLLM 0.19.0 release adds first‑day Gemma 4 support, merges zero‑bubble asynchronous scheduling with speculative decoding, matures Model Runner V2, introduces full‑CUDA‑graph acceleration for ViT, generalizes DBO, brings CPU KV cache offload, and expands hardware and Transformers compatibility, offering substantial performance and flexibility gains for production LLM inference.

Old Zhang's AI Learning
Old Zhang's AI Learning
Old Zhang's AI Learning
vLLM 0.19.0: HuggingFace v5 Support, Multimodal Boosts, and CPU KV Cache Offload

Overview of vLLM 0.19.0 updates

Gemma 4 first‑day support – Google’s top open‑source model runs on vLLM immediately after release.

Zero‑bubble async scheduling + speculative decoding – The two major optimizations now coexist without contention, giving high throughput and low latency together.

Model Runner V2 matured – Moves from experimental to production‑grade, adding many capabilities.

ViT full CUDA graph – Multimodal visual encoder now benefits from graph capture, eliminating per‑batch kernel launch overhead.

General CPU KV cache offload – When GPU memory is insufficient, KV cache spills to CPU with a pluggable cache policy.

DBO generalization – Dual‑Batch Overlap works for all model architectures, improving throughput.

NVIDIA B300/GB300 support – First‑day adaptation for the new SM 10.3 hardware.

Transformers v5 compatibility – Broad HuggingFace integration reduces compatibility issues.

Zero‑bubble async scheduling × speculative decoding: finally combined

In vLLM v0.18 the two optimizations conflicted because speculative decoding’s rejection sampling required a GPU‑to‑CPU sync before the next input could be prepared, creating a stall. v0.19.0 moves input preparation onto the GPU, allowing the rejection‑sampling result to be consumed directly on the GPU and eliminating the sync point. The result is simultaneous high‑throughput async scheduling and low‑latency speculative decoding.

Model Runner V2: from experimental to production‑grade

v0.18 labeled MRV2 as experimental and omitted many features (LoRA, linear attention, Eagle‑outside draft methods). v0.19.0 adds the following capabilities:

Pipeline Parallelism CUDA graph – Captures segmented CUDA graphs for pipeline parallelism, preventing speed loss on multi‑GPU deployments.

Speculative decoding rejection sampler – Supports Greedy decoding and Logprobs output.

Multimodal + speculative decoding – Enables speculative decoding for multimodal models.

Streaming inputs – Reduces first‑token latency.

EPLB – Expert‑level parallel load balancing, essential for MoE models.

FP32 draft logits + FP64 Gumbel noise – Improves numerical stability during speculative decoding.

Production use is enabled by setting the environment variable:

export VLLM_USE_V2_MODEL_RUNNER=1
# then run vLLM as usual

With MRV2, the combined engine (async scheduling, speculative decoding, CUDA graph) raises the performance ceiling beyond v0.18.

ViT full CUDA graph capture

Previously, each image/video request launched a fresh set of CUDA kernels for the visual encoder, causing noticeable overhead in small‑batch scenarios. v0.19.0 records the ViT computation graph once and replays it for subsequent inferences, cutting launch overhead and noticeably lowering latency for multimodal models such as Gemma 4 and Qwen‑VL.

CPU KV cache offload: spilling when GPU memory runs out

Long‑context requests can consume several gigabytes of KV cache, exhausting GPU memory. v0.19.0 introduces a generic CPU KV cache offload mechanism with a pluggable CachePolicy that decides which blocks to evict to CPU, supporting block‑level granularity and mixed‑model (SSM + Transformer) workloads. This effectively creates a “virtual memory” for KV cache.

Pluggable cache policy – Customizable eviction strategy.

Block‑level eviction – Fine‑grained control over which blocks are offloaded.

Mixed‑model support – Works for SSM and dense Transformer architectures.

DBO generalization: micro‑batch overlap for all models

Dual‑Batch Overlap (DBO) previously applied only to specific architectures. v0.19.0 makes DBO universal, allowing any model to benefit from overlapping pre‑fill and decode micro‑batches, thereby increasing throughput.

Hardware support updates

NVIDIA B300/GB300 (SM 10.3) – All‑Reduce fusion enabled by default and CUTLASS FP8 GEMM optimizations for Blackwell.

AMD ROCm – Updated to ROCm 7.2.1, PyTorch 2.10, Triton 3.6; DeepEP backend added for all‑to‑all communication.

Intel XPU – MLA model support with W4A8 quantization.

CPU – tcmalloc enabled by default, yielding a 48.9 % throughput boost for pure‑CPU deployments.

API and other notable updates

New endpoint – /v1/chat/completions/batch provides a dedicated batch inference API.

Thinking‑tokens hard limit – Allows setting a maximum “thinking” length for models like Qwen3‑Coder.

Short‑hand flag – -sc now aliases --speculative-config.

Quantization updates – Online MXFP8 quantization for MoE and dense models; QeRL adds online quantization + reload for RLHF scenarios.

Transformers v5 compatibility – Broad HuggingFace v5 support eliminates many compatibility errors.

Blog 1: Hidden‑state extraction – unlocking speculative‑decoding training pipelines

The blog explains how vLLM reuses its existing Eagle‑3 hidden‑state pipeline and KV Connector API to create a “fake” draft model that merely stores hidden states in its KV cache and exports them, enabling zero‑intrusion hidden‑state extraction.

Key steps:

Leverage the existing hidden‑state pipeline used by Eagle‑3.

Use the KV Connector API (supports disk, shared memory, Nixl, etc.) for data transfer.

Treat hidden states and KV cache as the same paged memory structure.

The resulting system allows full‑feature use of prefix cache, block pre‑fill, and automatic batching while extracting hidden states without modifying vLLM core code.

Example command:

vllm serve Qwen/Qwen3-8B --speculative_config '{
  "method": "extract_hidden_states",
  "num_speculative_tokens": 1,
  "draft_model_config": {
    "hf_config": {
      "eagle_aux_hidden_state_layer_ids": [3, 18, 33, 36]
    }
  }
}' --kv_transfer_config '{
  "kv_connector": "ExampleHiddenStatesConnector",
  "kv_role": "kv_producer",
  "kv_connector_extra_config": {
    "shared_storage_path": "/tmp/hidden_states"
  }
}'

Output files (safetensors) contain token_ids and hidden_states arrays for each request.

Supported flags include --tensor-parallel-size and --data-parallel-size. Currently only disk‑based ExampleHiddenStatesConnector is available; GPU‑direct transfer will be added later.

This feature is integrated into the Speculators library (PR #353) and will enable online training of draft models, closing the loop between inference and data generation.

Blog 2: Gemma 4 on vLLM – Day 0 four‑platform support

vLLM supports Gemma 4 on launch across four hardware platforms:

NVIDIA GPUs (A100, H100, B200)

Google TPUs (Trillium, Ironwood)

AMD GPUs (ROCm platform)

Intel XPU (first‑day inclusion)

TPU support is a notable addition, removing the previous gap for teams using Google Cloud.

Performance comparison on Arena.ai shows Gemma 4’s parameter efficiency surpasses peers of similar size.

Key capabilities of Gemma 4 in vLLM:

Multimodal – native image/video handling, edge models also support audio input.

Tool calling – built‑in function calling and structured JSON output with a dedicated parser.

Long context – 128 K tokens for edge models, 256 K for large models.

Advanced reasoning – strong performance on complex math and logic tasks.

140+ languages – native support.

Apache 2.0 license – commercial‑ready.

Quick start using the pre‑built Docker image:

# simplest way
docker run --gpus all vllm/vllm-openai:gemma4

Or manual launch (requires transformers>=5.5.0):

pip install vllm==0.19.0
vllm serve google/gemma-4-31b-it \
  --tensor-parallel-size 2 \
  --trust-remote-code

Further deployment details are in the official recipes.

Summary and recommendations

Combining the v0.19.0 release notes with the two blogs reveals a clear trajectory: the engine matures (MRV2, zero‑bubble scheduling), acceleration directions expand (hidden‑state extraction, speculative decoding customization), model ecosystem speeds up (Gemma 4 rapid multi‑platform support), and hardware coverage broadens (B300/GB300, ROCm, TPU, XPU).

If you use speculative decoding, upgrade to v0.19.0 – the zero‑bubble combination gives free throughput gains.

For multimodal workloads, enable ViT CUDA graphs and MRV2 multimodal speculative decoding to feel latency improvements.

If GPU memory is a bottleneck, try the CPU KV cache offload for long‑context scenarios.

Adopt Model Runner V2 for production inference; while LoRA is still unsupported, pure inference is ready.

vLLM update illustration
vLLM update illustration
Hidden state extraction system design
Hidden state extraction system design
Gemma 4 performance comparison
Gemma 4 performance comparison
Speculative DecodingvLLMGPUmultimodalLLM inferenceGemma 4CPU KV offloadTransformers v5
Old Zhang's AI Learning
Written by

Old Zhang's AI Learning

AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.