vLLM 0.17.0 Release: Full Qwen 3.5 Support and Anthropic API Compatibility
The vLLM 0.17.0 release brings FlashAttention 4 integration, a mature Model Runner V2, complete Qwen 3.5 series support, a one‑click performance‑mode flag, Anthropic API compatibility, advanced weight‑offloading, broader hardware support beyond NVIDIA, ASR model integration, and detailed upgrade and installation guidance.
vLLM 0.17.0 Release Highlights
FlashAttention 4 Integration
vLLM now supports the FlashAttention 4 backend, providing significant speed improvements for long‑sequence and large‑model inference, especially on H100/H200 GPUs.
Model Runner V2 Maturity
Model Runner V2 reaches a major milestone with the following capabilities:
Pipeline Parallel
Decode Context Parallel
Eagle3 speculative decoding with CUDA Graph support
Pooling model support
Segmented & mixed CUDA Graph capture
DP+EP speculative decoding (data‑parallel + expert‑parallel)
New ModelState architecture
The accompanying design document details the internal architecture.
Full Qwen 3.5 Series Support
All Qwen 3.5 variants (0.8B, 2B, 4B, 9B) are supported, leveraging:
Gated Delta Networks (GDN) architecture
FP8 quantization
MTP speculative decoding
Reasoning parser
One‑Click Performance Mode
Performance tuning is simplified to a single flag:
vllm serve your-model --performance-mode throughputAvailable modes: balanced – default, balanced for most scenarios interactivity – low first‑token latency for chat throughput – maximizes batch throughput
Anthropic API Compatibility
vLLM now implements Anthropic API features, including:
thinking blocks count_tokens endpoint tool_choice=none option
Improved streaming and image handling
Weight Offloading V2 with Prefetching
The new weight offloader adds a prefetch mechanism that loads the next layer’s weights on the CPU while the GPU processes the current layer, hiding weight‑load latency. It also supports selective CPU offloading and removes the requirement for double‑pinned memory.
Elastic Expert Parallel Phase 2
Mixture‑of‑Experts (MoE) models now support dynamic GPU scaling, automatically adjusting the number of GPUs based on load to reduce cost during low‑utilization periods.
Direct Loading of Quantized LoRA Adapters
Quantized LoRA adapters (e.g., QLoRA) can be loaded directly, streamlining the workflow from LoRA fine‑tuning to deployment.
Speculative Decoding Enhancements
Eagle3 now supports CUDA Graph for faster speculative decoding.
Nemotron‑H adds MTP and Mamba speculative decoding.
Sparse MLA + MTP receive full CUDA Graph support.
DP+EP combines data‑parallel and expert‑parallel speculative decoding.
Eagle3 introduces disaggregated serving.
Kernel‑Level Optimizations
FlashInfer Sparse MLA backend.
Triton top‑k / top‑p sampler kernels.
TRTLLM DSV3 Router GEMM kernel (+6% batch‑1 speed).
FA3 swizzle optimization.
256‑bit LDG/STG activation kernel.
Helion kernel framework for auto‑tuning.
Combined, these yield up to 0.5% end‑to‑end latency reduction, 2.9% throughput increase for pipeline parallel, and 13.9% throughput boost for pooling maxsim.
Hardware Support Beyond NVIDIA
NVIDIA
SM100 (Blackwell) FP8 MLA prefill support.
SM100 MXFP8 block‑wise scaling.
SM120 FP8 GEMM optimizations.
FlashInfer DeepGEMM with swapAB on SM90.
AMD ROCm
AITER fused RoPE+KVCache.
MXFP4 MoE weight shuffling on gfx950.
bitsandbytes quantization support.
Composable Kernel (CK) MoE quantization backend.
Intel XPU
CUDA Graph support.
NIXL GPUDirect RDMA.
CPU
ARM BF16 cross‑compilation.
s390x FP16 support.
Builds for both AVX2 and AVX512.
ASR Model Support
vLLM now supports speech‑recognition models, extending inference beyond pure LLMs:
FunASR
FireRedASR2
Qwen3‑ASR (real‑time streaming)
Upgrade Considerations
PyTorch 2.10 upgrade – breaking change; verify compatibility with existing environments.
CUDA 12.9+ known issue – CUBLAS_STATUS_INVALID_VALUE can be resolved by:
Clearing LD_LIBRARY_PATH (e.g., unset LD_LIBRARY_PATH).
Installing via uv pip install vllm --torch-backend=auto.
Specifying the CUDA index URL:
pip install vllm --extra-index-url https://download.pytorch.org/whl/cu129.
KV cache load policy – default changed from recompute to fail; adjust manually if automatic recomputation is required.
Installation
Python: uv pip install vllm Docker:
docker pull vllm/vllm-openai:v0.17.0
docker run --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=$HF_TOKEN" \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai:v0.17.0 \
--model Qwen/Qwen3-0.6BvLLM vs. SGLang
vLLM – larger community (50k+ GitHub stars), broader hardware compatibility, richer enterprise features (pipeline parallel, disaggregated serving); suited for production deployments.
SGLang – delivers peak performance for specific models (e.g., DeepSeek) and offers a more modern API; suited for extreme‑performance scenarios.
Conclusion
vLLM 0.17.0 introduces FlashAttention 4, a mature Model Runner V2, full Qwen 3.5 support, one‑click performance modes, Anthropic API compatibility, weight‑offloading V2 with prefetch, elastic expert parallelism, direct quantized LoRA loading, extensive speculative decoding and kernel optimizations, and expanded hardware and ASR model support. These updates constitute substantial engineering advances for large‑model inference deployments.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Old Zhang's AI Learning
AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
