vLLM 0.23.0 Brings Faster Local LLM Deployment and Wider Hardware Support

Version 0.23.0 of the open‑source vLLM inference engine adds full DeepSeek‑V4 stability, Model Runner V2 coverage for Llama, Mistral, Qwen3 and new models, a production‑grade Rust front‑end, multi‑level KV‑cache offloading, extensive hardware optimizations across NVIDIA, AMD, Intel, TPU and RISC‑V, plus API enhancements, delivering up to 20 % performance gains while simplifying deployment.

Old Zhang's AI Learning
Old Zhang's AI Learning
Old Zhang's AI Learning
vLLM 0.23.0 Brings Faster Local LLM Deployment and Wider Hardware Support

vLLM is one of the most popular open‑source large‑model inference engines, supporting models from DeepSeek to Llama and Qwen. Version 0.23.0 focuses on making more models run faster on more hardware.

DeepSeek‑V4 Maturity

The previous release v0.22.0 introduced DeepSeek‑V4 support; v0.23.0 adds extensive hardening and optimizations such as decoupling sparse MLA metadata from V3.2, adding the TRTLLM gen‑attention kernel, EPLB support for Mega‑MoE, selective prefix caching for sliding‑window KV cache, and the DSA MTP index‑share feature. As a result, DeepSeek‑V4 runs more stably and faster, and its torch.compile dependency has been removed, improving startup speed and compatibility, with an added XPU attention decode path that enables Intel GPUs.

Model Runner V2 Coverage

Model Runner V2 now enables Llama and Mistral dense models by default, and together with existing Qwen3 support, it covers virtually all mainstream open‑source models. New additions include the FlashInfer sampler, interruptible CUDA graphs, pipeline‑parallel bubble elimination, mixed‑model kernel block‑size support, and Gemma 4 MTP. Production users of Llama or Mistral automatically benefit from these performance gains without manual configuration.

Rust Front‑End Matures

The experimental Rust front‑end gains many production‑grade features: a generate endpoint, dynamic LoRA endpoint, /version and /server_info endpoints, server routing hooks, request‑ID header, and new tool parsers for InternLM2, hy_v3, Phi‑4‑mini, and Gemma4. The author notes the progress exceeds expectations and the next release may no longer be "experimental".

Gemma 4 Full Support

Google's Gemma 4 receives comprehensive support, including encoder‑free Gemma 4 Unified, Gemma 4 MTP (multi‑token prediction), accuracy and startup fixes, automatic exclusion of vision embedder during quantization, and native implementation of ViT linear layers.

Multi‑Level KV‑Cache Offloading

The KV‑cache offloading framework adds an object‑store as a secondary storage layer, enables HMA by default, and allows per‑request offloading policies, which is especially useful for extremely long contexts where memory can be offloaded to CPU or object storage.

Performance Improvements

CUTLASS FP8 scaled‑mm padding bypass : +20%

MoE‑permute buffer pre‑allocation : +9‑14%

Triton MoE backend enabled by default on Hopper

Selective_state_update tuning for H200/RTX PRO

Gemma RMS all‑reduce fusion

DGX B300 NUMA auto‑binding

The 20 % boost comes from engineering optimizations that eliminate redundant computation rather than algorithmic breakthroughs.

Hardware Support

NVIDIA : full optimization for Hopper (H100/H200) and new DGX B300 NUMA binding

AMD ROCm : upgraded to 7.2.3, AITER v0.1.13.post1, native W4A16 kernel for RDNA3 (gfx1100)

Intel XPU : vllm‑xpu‑kernel v0.1.7 with FP8 MoE and DeepSeek‑V4 decode path

CPU : AMD Zen acceleration (zentorch) and CPU Triton sampling

TPU : tpu‑inference upgraded to v0.21.0

RISC‑V : WNA16 helpers

ARM64 : CI image support

PowerPC : SHM communicator

This breadth of architecture support is unprecedented in the open‑source community.

New Models Added

Step‑3.7‑Flash: Flash version of the "Step" series

Cosmos3 Reasoner: NVIDIA inference model

Gemma 4 Unified: Google encoder‑free multimodal model

JetBrains Mellum v2: code generation model

Granite Speech Plus: IBM speech model

Cohere Mini Code: Cohere's small code model

Additional fixes cover Qwen3‑VL, GLM‑5.1, GLM‑4.1V, MiniCPM‑V‑4.6, Kimi‑K2.5, ensuring coverage of most domestically available models.

API Updates

Anthropic Messages API : supports structured output and the effort parameter

OpenAI Responses API : adds system_fingerprint field and streaming tool calling with required Unified Parser : merges reasoning and tool‑call parsing into Parser.parse(), simplifying downstream development

Installation

pip install vllm==0.23.0

For hardware‑specific builds (e.g., ROCm), refer to the official documentation for the appropriate install command.

Upgrade Guidance

Upgrading is not recommended if the current version is stable and the new features are not needed; waiting a week or two for community feedback is acceptable. Note that MiniMax M3 is not yet supported in this release and requires following the vLLM recipe.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

vLLMLLM inferenceHardware accelerationDeepSeek-V4KV cache offloadingRust front-end
Old Zhang's AI Learning
Written by

Old Zhang's AI Learning

AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.