Deploying Qwen3.5 with vLLM: Full-Precision and Quantized Versions, Concurrency Benchmarks, and Scripts
The article walks through upgrading vLLM to 0.17.0, configuring Docker containers for 4090 GPUs, comparing FP8 and 4‑bit quantization of Qwen3.5 35B and 27B models, and presents detailed performance numbers and script parameters that reveal trade‑offs in memory usage and throughput.
The author upgrades vLLM to version 0.17.0 and notes that hardware and CUDA versions must be compatible; Docker is used to avoid system quirks. The image is pulled with docker pull vllm/vllm-openai:v0.17.0. The 35B model weighs 37 GB, the 27B model 30 GB, and initial runs encounter OOM errors.
A deployment script for the 35B model is provided (the 27B script only needs the model path and name changes). Key command‑line flags are explained: --tensor-parallel-size 4 – uses four 4090 GPUs. --max-model-len 262144 – enables very long context at the cost of some concurrency. --kv-cache-dtype fp8 – reduces KV‑cache memory to support longer context. --gpu-memory-utilization 0.9 – leaves headroom for CUDA graph, NCCL buffers, allocator fragmentation, and batching spikes. --max-num-seqs 4 – caps simultaneous sequences to avoid memory explosion when long contexts and high concurrency combine. --max-num-batched-tokens 8192 – controls total tokens per scheduling step; larger values increase throughput but also memory volatility. --language-model-only – disables multimodal features, keeping inference text‑only. --enable-prefix-caching – improves KV management and throughput. --default-chat-template-kwargs '{"enable_thinking": false}' – disables the “thinking” mode that would otherwise add latency.
#!/usr/bin/env bash
set -euo pipefail
MODEL_DIR="/data/models/Qwen3.5-35B-A3B-FP8"
CONTAINER_NAME="qwen35-35b-a3b-fp8"
PORT=8000
docker rm -f ${CONTAINER_NAME} 2>/dev/null || true
docker run -d \
--name ${CONTAINER_NAME} \
--gpus '"device=0,1,2,3"' \
--ipc=host \
--shm-size=16g \
-p ${PORT}:8000 \
-v ${MODEL_DIR}:/model:ro \
-e NCCL_P2P_DISABLE=0 \
-e NCCL_IB_DISABLE=1 \
-e VLLM_USE_V1=1 \
vllm/vllm-openai:v0.17.0 \
--model /model \
--served-model-name qwen3.5-35b-a3b-fp8 \
--tensor-parallel-size 4 \
--max-model-len 262144 \
--kv-cache-dtype fp8 \
--gpu-memory-utilization 0.9 \
--max-num-seqs 4 \
--max-num-batched-tokens 8192 \
--language-model-only \
--enable-prefix-caching \
--default-chat-template-kwargs '{"enable_thinking": false}' \
--host 0.0.0.0 \
--port 8000Running the FP8 version shows very poor concurrency: the 27B model has almost no throughput, while the 35B‑A3B model can serve but with low RPS and first‑token latency around 10 seconds.
Switching to 4‑bit quantization dramatically reduces memory usage, allowing both 27B and 35B to run on just two 4090 GPUs. The same script is reused with adjusted model paths, and --max-num-seqs is increased.
Integration with OpenWebUI (with thinking disabled) still leaves the 27B model noticeably slower. Log measurements show roughly 70 tokens/s for 27B and 100 tokens/s for 35B, confirming the performance gap.
Qualitative observations note that code generation quality is weak for both models, and the 27B model remains inferior even after 4‑bit quantization, while the 35B model outperforms the FP8 variant.
The author concludes that, for his requirements, the stable Qwen3‑32B remains the preferred choice. Additionally, Qwen3.5 changes the <think> tag from dynamic generation to a static preset, forcing downstream systems to adapt, and the model does not support a soft toggle for thinking, limiting its appeal for enterprise applications.
Old Zhang's AI Learning
AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
