vLLM 0.17.0 Release: Full Qwen 3.5 Support and Anthropic API Compatibility

The vLLM 0.17.0 release brings FlashAttention 4 integration, a mature Model Runner V2, complete Qwen 3.5 series support, a one‑click performance‑mode flag, Anthropic API compatibility, advanced weight‑offloading, broader hardware support beyond NVIDIA, ASR model integration, and detailed upgrade and installation guidance.

Old Zhang's AI Learning
Old Zhang's AI Learning
Old Zhang's AI Learning
vLLM 0.17.0 Release: Full Qwen 3.5 Support and Anthropic API Compatibility

vLLM 0.17.0 Release Highlights

FlashAttention 4 Integration

vLLM now supports the FlashAttention 4 backend, providing significant speed improvements for long‑sequence and large‑model inference, especially on H100/H200 GPUs.

Model Runner V2 Maturity

Model Runner V2 reaches a major milestone with the following capabilities:

Pipeline Parallel

Decode Context Parallel

Eagle3 speculative decoding with CUDA Graph support

Pooling model support

Segmented & mixed CUDA Graph capture

DP+EP speculative decoding (data‑parallel + expert‑parallel)

New ModelState architecture

The accompanying design document details the internal architecture.

Full Qwen 3.5 Series Support

All Qwen 3.5 variants (0.8B, 2B, 4B, 9B) are supported, leveraging:

Gated Delta Networks (GDN) architecture

FP8 quantization

MTP speculative decoding

Reasoning parser

One‑Click Performance Mode

Performance tuning is simplified to a single flag:

vllm serve your-model --performance-mode throughput

Available modes: balanced – default, balanced for most scenarios interactivity – low first‑token latency for chat throughput – maximizes batch throughput

Anthropic API Compatibility

vLLM now implements Anthropic API features, including:

thinking blocks count_tokens endpoint tool_choice=none option

Improved streaming and image handling

Weight Offloading V2 with Prefetching

The new weight offloader adds a prefetch mechanism that loads the next layer’s weights on the CPU while the GPU processes the current layer, hiding weight‑load latency. It also supports selective CPU offloading and removes the requirement for double‑pinned memory.

Elastic Expert Parallel Phase 2

Mixture‑of‑Experts (MoE) models now support dynamic GPU scaling, automatically adjusting the number of GPUs based on load to reduce cost during low‑utilization periods.

Direct Loading of Quantized LoRA Adapters

Quantized LoRA adapters (e.g., QLoRA) can be loaded directly, streamlining the workflow from LoRA fine‑tuning to deployment.

Speculative Decoding Enhancements

Eagle3 now supports CUDA Graph for faster speculative decoding.

Nemotron‑H adds MTP and Mamba speculative decoding.

Sparse MLA + MTP receive full CUDA Graph support.

DP+EP combines data‑parallel and expert‑parallel speculative decoding.

Eagle3 introduces disaggregated serving.

Kernel‑Level Optimizations

FlashInfer Sparse MLA backend.

Triton top‑k / top‑p sampler kernels.

TRTLLM DSV3 Router GEMM kernel (+6% batch‑1 speed).

FA3 swizzle optimization.

256‑bit LDG/STG activation kernel.

Helion kernel framework for auto‑tuning.

Combined, these yield up to 0.5% end‑to‑end latency reduction, 2.9% throughput increase for pipeline parallel, and 13.9% throughput boost for pooling maxsim.

Hardware Support Beyond NVIDIA

NVIDIA

SM100 (Blackwell) FP8 MLA prefill support.

SM100 MXFP8 block‑wise scaling.

SM120 FP8 GEMM optimizations.

FlashInfer DeepGEMM with swapAB on SM90.

AMD ROCm

AITER fused RoPE+KVCache.

MXFP4 MoE weight shuffling on gfx950.

bitsandbytes quantization support.

Composable Kernel (CK) MoE quantization backend.

Intel XPU

CUDA Graph support.

NIXL GPUDirect RDMA.

CPU

ARM BF16 cross‑compilation.

s390x FP16 support.

Builds for both AVX2 and AVX512.

ASR Model Support

vLLM now supports speech‑recognition models, extending inference beyond pure LLMs:

FunASR

FireRedASR2

Qwen3‑ASR (real‑time streaming)

Upgrade Considerations

PyTorch 2.10 upgrade – breaking change; verify compatibility with existing environments.

CUDA 12.9+ known issue – CUBLAS_STATUS_INVALID_VALUE can be resolved by:

Clearing LD_LIBRARY_PATH (e.g., unset LD_LIBRARY_PATH).

Installing via uv pip install vllm --torch-backend=auto.

Specifying the CUDA index URL:

pip install vllm --extra-index-url https://download.pytorch.org/whl/cu129

.

KV cache load policy – default changed from recompute to fail; adjust manually if automatic recomputation is required.

Installation

Python: uv pip install vllm Docker:

docker pull vllm/vllm-openai:v0.17.0
docker run --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  --env "HF_TOKEN=$HF_TOKEN" \
  -p 8000:8000 \
  --ipc=host \
  vllm/vllm-openai:v0.17.0 \
  --model Qwen/Qwen3-0.6B

vLLM vs. SGLang

vLLM – larger community (50k+ GitHub stars), broader hardware compatibility, richer enterprise features (pipeline parallel, disaggregated serving); suited for production deployments.

SGLang – delivers peak performance for specific models (e.g., DeepSeek) and offers a more modern API; suited for extreme‑performance scenarios.

Conclusion

vLLM 0.17.0 introduces FlashAttention 4, a mature Model Runner V2, full Qwen 3.5 support, one‑click performance modes, Anthropic API compatibility, weight‑offloading V2 with prefetch, elastic expert parallelism, direct quantized LoRA loading, extensive speculative decoding and kernel optimizations, and expanded hardware and ASR model support. These updates constitute substantial engineering advances for large‑model inference deployments.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Speculative DecodingvLLMFlashAttentionGPUASRModel Runnerqwen3.5Anthropic API
Old Zhang's AI Learning
Written by

Old Zhang's AI Learning

AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.