vLLM 0.23.0 Brings Faster Local LLM Deployment and Wider Hardware Support
Version 0.23.0 of the open‑source vLLM inference engine adds full DeepSeek‑V4 stability, Model Runner V2 coverage for Llama, Mistral, Qwen3 and new models, a production‑grade Rust front‑end, multi‑level KV‑cache offloading, extensive hardware optimizations across NVIDIA, AMD, Intel, TPU and RISC‑V, plus API enhancements, delivering up to 20 % performance gains while simplifying deployment.
vLLM is one of the most popular open‑source large‑model inference engines, supporting models from DeepSeek to Llama and Qwen. Version 0.23.0 focuses on making more models run faster on more hardware.
DeepSeek‑V4 Maturity
The previous release v0.22.0 introduced DeepSeek‑V4 support; v0.23.0 adds extensive hardening and optimizations such as decoupling sparse MLA metadata from V3.2, adding the TRTLLM gen‑attention kernel, EPLB support for Mega‑MoE, selective prefix caching for sliding‑window KV cache, and the DSA MTP index‑share feature. As a result, DeepSeek‑V4 runs more stably and faster, and its torch.compile dependency has been removed, improving startup speed and compatibility, with an added XPU attention decode path that enables Intel GPUs.
Model Runner V2 Coverage
Model Runner V2 now enables Llama and Mistral dense models by default, and together with existing Qwen3 support, it covers virtually all mainstream open‑source models. New additions include the FlashInfer sampler, interruptible CUDA graphs, pipeline‑parallel bubble elimination, mixed‑model kernel block‑size support, and Gemma 4 MTP. Production users of Llama or Mistral automatically benefit from these performance gains without manual configuration.
Rust Front‑End Matures
The experimental Rust front‑end gains many production‑grade features: a generate endpoint, dynamic LoRA endpoint, /version and /server_info endpoints, server routing hooks, request‑ID header, and new tool parsers for InternLM2, hy_v3, Phi‑4‑mini, and Gemma4. The author notes the progress exceeds expectations and the next release may no longer be "experimental".
Gemma 4 Full Support
Google's Gemma 4 receives comprehensive support, including encoder‑free Gemma 4 Unified, Gemma 4 MTP (multi‑token prediction), accuracy and startup fixes, automatic exclusion of vision embedder during quantization, and native implementation of ViT linear layers.
Multi‑Level KV‑Cache Offloading
The KV‑cache offloading framework adds an object‑store as a secondary storage layer, enables HMA by default, and allows per‑request offloading policies, which is especially useful for extremely long contexts where memory can be offloaded to CPU or object storage.
Performance Improvements
CUTLASS FP8 scaled‑mm padding bypass : +20%
MoE‑permute buffer pre‑allocation : +9‑14%
Triton MoE backend enabled by default on Hopper
Selective_state_update tuning for H200/RTX PRO
Gemma RMS all‑reduce fusion
DGX B300 NUMA auto‑binding
The 20 % boost comes from engineering optimizations that eliminate redundant computation rather than algorithmic breakthroughs.
Hardware Support
NVIDIA : full optimization for Hopper (H100/H200) and new DGX B300 NUMA binding
AMD ROCm : upgraded to 7.2.3, AITER v0.1.13.post1, native W4A16 kernel for RDNA3 (gfx1100)
Intel XPU : vllm‑xpu‑kernel v0.1.7 with FP8 MoE and DeepSeek‑V4 decode path
CPU : AMD Zen acceleration (zentorch) and CPU Triton sampling
TPU : tpu‑inference upgraded to v0.21.0
RISC‑V : WNA16 helpers
ARM64 : CI image support
PowerPC : SHM communicator
This breadth of architecture support is unprecedented in the open‑source community.
New Models Added
Step‑3.7‑Flash: Flash version of the "Step" series
Cosmos3 Reasoner: NVIDIA inference model
Gemma 4 Unified: Google encoder‑free multimodal model
JetBrains Mellum v2: code generation model
Granite Speech Plus: IBM speech model
Cohere Mini Code: Cohere's small code model
Additional fixes cover Qwen3‑VL, GLM‑5.1, GLM‑4.1V, MiniCPM‑V‑4.6, Kimi‑K2.5, ensuring coverage of most domestically available models.
API Updates
Anthropic Messages API : supports structured output and the effort parameter
OpenAI Responses API : adds system_fingerprint field and streaming tool calling with required Unified Parser : merges reasoning and tool‑call parsing into Parser.parse(), simplifying downstream development
Installation
pip install vllm==0.23.0For hardware‑specific builds (e.g., ROCm), refer to the official documentation for the appropriate install command.
Upgrade Guidance
Upgrading is not recommended if the current version is stable and the new features are not needed; waiting a week or two for community feedback is acceptable. Note that MiniMax M3 is not yet supported in this release and requires following the vLLM recipe.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Old Zhang's AI Learning
AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
