Why vLLM Now Leads Open‑Source LLM Inference Benchmarks

vLLM tops the Artificial Analysis ranking by delivering the highest throughput for DeepSeek V3.2, Qwen 3.5 397B, and MiniMax‑M2.5 on identical NVIDIA Blackwell Ultra hardware, thanks to extensive kernel‑fusion optimizations that remain in the main branch.

Old Zhang's AI Learning
Old Zhang's AI Learning
Old Zhang's AI Learning
Why vLLM Now Leads Open‑Source LLM Inference Benchmarks

DigitalOcean’s recent inference benchmark, published in Artificial Analysis, evaluated three cutting‑edge open‑source models—DeepSeek V3.2, Qwen 3.5 397B, and MiniMax‑M2.5—using the same NVIDIA Blackwell Ultra GPUs. vLLM, the inference engine behind the tests, achieved the top rank for all three models.

Benchmark data

DeepSeek V3.2 reached a peak single‑user output speed of 230 TPS , more than four times the average of other providers.

Qwen 3.5 397B recorded the fastest time‑to‑first‑token (TTFT) under 1 second for a 10,000‑token prompt, ranking first among twelve vendors.

MiniMax‑M2.5 also secured the first place in the same test set.

The engine powering these results is vLLM, whose optimizations are all merged into the main branch rather than hidden in private forks, meaning users can reproduce the numbers on their own deployments.

How vLLM achieves the performance

vLLM addresses model‑specific bottlenecks with targeted solutions:

DeepSeek V3.2 – low‑batch kernel fusion : The model suffers from GPU kernel launch overhead when batch size is small, because each Transformer layer launches over 30 independent kernels (normalization, rotary embedding, quantization, KV‑cache write, etc.). vLLM fuses the attention path operations—Q/KV normalization, rotary embedding, indexer layer‑norm + rotary, FP8 quantization, and KV‑cache write—into two fused kernels, reducing kernel count from ~33 to ~10 and delivering a 1.28× speedup (85.8 → 109.3 tok/s on a 4× GB200 instance). Additional results on an 8× B300 node show 125 tok/s without MTP, 234 tok/s with MTP = 1, and 262 tok/s with prefill/decode disaggregation (TP = 4 + 4 + MTP = 3).

MiniMax‑M2.5 – EAGLE3 + directed kernel fusion : vLLM built a custom draft model using the open‑source TorchSpec and vLLM training pipeline. The same draft can be reused for MiniMax‑M2.7 because of architectural consistency. Directed kernel fusion and the EAGLE3 draft model improve throughput.

Qwen 3.5 397B – attention + normalization path fusion : vLLM applied a focused fusion on the linear‑attention path and combined it with attention and normalization optimizations, securing the benchmark’s top spot.

Additional enhancements include a new router GEMM kernel for DeepSeek V3 MoE routing (+6% on batch = 1, PR #34302) and a sparse‑attention TopK kernel that automatically selects the best algorithm per sequence length, reducing single‑token decode latency by 17% on 128K context (PR #37421).

Why this matters

The industry often assumes that production‑grade inference performance requires proprietary stacks. The Artificial Analysis ranking disproves this belief: a community‑driven open‑source engine, running on the same hardware as commercial solutions, outperforms all of them. Moreover, all performance‑critical code resides in vLLM’s public repository, allowing anyone to inspect the PRs that deliver the speed gains.

Overall impact

Over the past year vLLM has progressed from “performance comparable to TGI” to “dominates benchmarks across DeepSeek, Qwen, MiniMax, and Omni multimodal models.” By keeping optimizations in the main codebase, the project avoids the temptation of private patches, fostering stronger community engagement and continuous performance improvements. Enterprises still using closed‑source inference services are encouraged to reassess their options, and self‑hosted deployments now have a compelling open‑source alternative.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

vLLMDeepSeekQwenLLM inferenceopen-sourcekernel fusionperformance benchmarking
Old Zhang's AI Learning
Written by

Old Zhang's AI Learning

AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.