Artificial Intelligence 9 min read

vLLM 0.24.0 Release: New Features for Faster, Memory‑Efficient Large‑Model Deployment

The vLLM 0.24.0 update adds MiniMax‑M3, DeepSeek‑V4, DiffusionGemma support, a Streaming Parser Engine, and a new device_ids parameter, delivering faster inference, lower memory use, and broader hardware compatibility for large‑model deployments.

Old Zhang's AI Learning

Jul 2, 2026

vLLM 0.24.0 Release: New Features for Faster, Memory‑Efficient Large‑Model Deployment

vLLM is a fast, memory‑efficient inference engine known for its PagedAttention technique, and version 0.24.0 has just been released. The author reviews the most notable changes and their impact on large‑model deployment.

MiniMax‑M3 First‑Class Support

The update adds full MiniMax‑M3 compatibility, including BF16/FP8 indexer, MXFP4 quantization, FP8 sparse GQA, and ROCm tuning (e.g., gfx950 MxFP8 MoE tuning, MI300X BF16 weights with FP8 per‑channel quantization). It also fixes a performance regression in MiniMax‑M2.

DeepSeek‑V4 Maturation

FlashInfer sparse index cache improves TTFT by 2‑4%.

Prefill chunk‑planning optimization raises end‑to‑end throughput by 4%.

Cluster‑cooperative topK kernel added for low‑latency scenarios.

KV cache allocation changed to per‑block continuous allocation.

OOM bug fixed.

Support extended to SM120, Intel XPU, and AMD ROCm.

These modest‑looking percentages translate into noticeable capacity gains for 1‑trillion‑parameter models and more stable multi‑request, MoE, and long‑context workloads.

DiffusionGemma Joins the Main Line

DiffusionGemma is now officially supported, with a CPU execution path and structured‑output protection for the diffusion decoder, showing vLLM’s expanding support for non‑autoregressive generation paradigms.

Model Runner V2 Expands Capabilities

Version 0.24.0 enables default quantized‑model support, turns on GraniteMoE by default, migrates Qwen and DeepSeek‑V2 MoE models, adds DFlash speculative decoding, and improves FP32 Gumbel sampling. The author interprets this as vLLM moving into a “core path replacement” phase, where new optimizations will land first in MRv2.

Streaming Parser Engine for the Agent Era

The new Streaming Parser Engine consolidates parsing logic for tool calls, reasoning, structured fragments, and incremental streaming across diverse model formats (Qwen3, MiniMax‑M2, GLM, Nemotron). This uniform engine is crucial for reliable agent deployments, where parsing errors can cause failures.

Device Selection Change

vLLM no longer sets CUDA_VISIBLE_DEVICES internally; instead it introduces a device_ids parameter for explicit GPU selection. Users must adjust scripts that relied on the previous internal handling, especially in multi‑service GPU sharing, Kubernetes GPU allocation, Ray or multi‑process workers, and ROCm environments.

Upgrade and Usage Guidance

uv venv --python 3.12 --seed
source .venv/bin/activate
uv pip install vllm --torch-backend=auto

To pin the exact version:

uv pip install "vllm==0.24.0" --torch-backend=auto

For production, the author recommends creating a fresh environment, fixing the version, and running regression tests with real request sets covering multimodal, tool‑call, reasoning, quantized, LoRA, and long‑context scenarios before replacing existing deployments.

Conclusion

The author rates vLLM 0.24.0 as “upgrade‑worthy but requires careful testing.” It is especially valuable for users of MiniMax‑M3, DeepSeek‑V4, DiffusionGemma, Qwen multimodal models, Rust front‑end services, or those needing stable streaming parsing, while single‑GPU, ordinary model users can wait for community feedback.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Large Language Models vLLM inference engine PagedAttention DeepSeek-V4 MiniMax-M3 DiffusionGemma Streaming Parser

Written by

Old Zhang's AI Learning

AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.