Why vLLM Now Leads Open‑Source LLM Inference Benchmarks
vLLM tops the Artificial Analysis ranking by delivering the highest throughput for DeepSeek V3.2, Qwen 3.5 397B, and MiniMax‑M2.5 on identical NVIDIA Blackwell Ultra hardware, thanks to extensive kernel‑fusion optimizations that remain in the main branch.
DigitalOcean’s recent inference benchmark, published in Artificial Analysis, evaluated three cutting‑edge open‑source models—DeepSeek V3.2, Qwen 3.5 397B, and MiniMax‑M2.5—using the same NVIDIA Blackwell Ultra GPUs. vLLM, the inference engine behind the tests, achieved the top rank for all three models.
Benchmark data
DeepSeek V3.2 reached a peak single‑user output speed of 230 TPS , more than four times the average of other providers.
Qwen 3.5 397B recorded the fastest time‑to‑first‑token (TTFT) under 1 second for a 10,000‑token prompt, ranking first among twelve vendors.
MiniMax‑M2.5 also secured the first place in the same test set.
The engine powering these results is vLLM, whose optimizations are all merged into the main branch rather than hidden in private forks, meaning users can reproduce the numbers on their own deployments.
How vLLM achieves the performance
vLLM addresses model‑specific bottlenecks with targeted solutions:
DeepSeek V3.2 – low‑batch kernel fusion : The model suffers from GPU kernel launch overhead when batch size is small, because each Transformer layer launches over 30 independent kernels (normalization, rotary embedding, quantization, KV‑cache write, etc.). vLLM fuses the attention path operations—Q/KV normalization, rotary embedding, indexer layer‑norm + rotary, FP8 quantization, and KV‑cache write—into two fused kernels, reducing kernel count from ~33 to ~10 and delivering a 1.28× speedup (85.8 → 109.3 tok/s on a 4× GB200 instance). Additional results on an 8× B300 node show 125 tok/s without MTP, 234 tok/s with MTP = 1, and 262 tok/s with prefill/decode disaggregation (TP = 4 + 4 + MTP = 3).
MiniMax‑M2.5 – EAGLE3 + directed kernel fusion : vLLM built a custom draft model using the open‑source TorchSpec and vLLM training pipeline. The same draft can be reused for MiniMax‑M2.7 because of architectural consistency. Directed kernel fusion and the EAGLE3 draft model improve throughput.
Qwen 3.5 397B – attention + normalization path fusion : vLLM applied a focused fusion on the linear‑attention path and combined it with attention and normalization optimizations, securing the benchmark’s top spot.
Additional enhancements include a new router GEMM kernel for DeepSeek V3 MoE routing (+6% on batch = 1, PR #34302) and a sparse‑attention TopK kernel that automatically selects the best algorithm per sequence length, reducing single‑token decode latency by 17% on 128K context (PR #37421).
Why this matters
The industry often assumes that production‑grade inference performance requires proprietary stacks. The Artificial Analysis ranking disproves this belief: a community‑driven open‑source engine, running on the same hardware as commercial solutions, outperforms all of them. Moreover, all performance‑critical code resides in vLLM’s public repository, allowing anyone to inspect the PRs that deliver the speed gains.
Overall impact
Over the past year vLLM has progressed from “performance comparable to TGI” to “dominates benchmarks across DeepSeek, Qwen, MiniMax, and Omni multimodal models.” By keeping optimizations in the main codebase, the project avoids the temptation of private patches, fostering stronger community engagement and continuous performance improvements. Enterprises still using closed‑source inference services are encouraged to reassess their options, and self‑hosted deployments now have a compelling open‑source alternative.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Old Zhang's AI Learning
AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
