Deploying Qwen3.5 with vLLM: Full-Precision and Quantized Versions, Concurrency Benchmarks, and Scripts

The article walks through upgrading vLLM to 0.17.0, configuring Docker containers for 4090 GPUs, comparing FP8 and 4‑bit quantization of Qwen3.5 35B and 27B models, and presents detailed performance numbers and script parameters that reveal trade‑offs in memory usage and throughput.

Old Zhang's AI Learning
Old Zhang's AI Learning
Old Zhang's AI Learning
Deploying Qwen3.5 with vLLM: Full-Precision and Quantized Versions, Concurrency Benchmarks, and Scripts

The author upgrades vLLM to version 0.17.0 and notes that hardware and CUDA versions must be compatible; Docker is used to avoid system quirks. The image is pulled with docker pull vllm/vllm-openai:v0.17.0. The 35B model weighs 37 GB, the 27B model 30 GB, and initial runs encounter OOM errors.

A deployment script for the 35B model is provided (the 27B script only needs the model path and name changes). Key command‑line flags are explained: --tensor-parallel-size 4 – uses four 4090 GPUs. --max-model-len 262144 – enables very long context at the cost of some concurrency. --kv-cache-dtype fp8 – reduces KV‑cache memory to support longer context. --gpu-memory-utilization 0.9 – leaves headroom for CUDA graph, NCCL buffers, allocator fragmentation, and batching spikes. --max-num-seqs 4 – caps simultaneous sequences to avoid memory explosion when long contexts and high concurrency combine. --max-num-batched-tokens 8192 – controls total tokens per scheduling step; larger values increase throughput but also memory volatility. --language-model-only – disables multimodal features, keeping inference text‑only. --enable-prefix-caching – improves KV management and throughput. --default-chat-template-kwargs '{"enable_thinking": false}' – disables the “thinking” mode that would otherwise add latency.

#!/usr/bin/env bash
set -euo pipefail

MODEL_DIR="/data/models/Qwen3.5-35B-A3B-FP8"
CONTAINER_NAME="qwen35-35b-a3b-fp8"
PORT=8000

docker rm -f ${CONTAINER_NAME} 2>/dev/null || true

docker run -d \
  --name ${CONTAINER_NAME} \
  --gpus '"device=0,1,2,3"' \
  --ipc=host \
  --shm-size=16g \
  -p ${PORT}:8000 \
  -v ${MODEL_DIR}:/model:ro \
  -e NCCL_P2P_DISABLE=0 \
  -e NCCL_IB_DISABLE=1 \
  -e VLLM_USE_V1=1 \
  vllm/vllm-openai:v0.17.0 \
  --model /model \
  --served-model-name qwen3.5-35b-a3b-fp8 \
  --tensor-parallel-size 4 \
  --max-model-len 262144 \
  --kv-cache-dtype fp8 \
  --gpu-memory-utilization 0.9 \
  --max-num-seqs 4 \
  --max-num-batched-tokens 8192 \
  --language-model-only \
  --enable-prefix-caching \
  --default-chat-template-kwargs '{"enable_thinking": false}' \
  --host 0.0.0.0 \
  --port 8000

Running the FP8 version shows very poor concurrency: the 27B model has almost no throughput, while the 35B‑A3B model can serve but with low RPS and first‑token latency around 10 seconds.

Switching to 4‑bit quantization dramatically reduces memory usage, allowing both 27B and 35B to run on just two 4090 GPUs. The same script is reused with adjusted model paths, and --max-num-seqs is increased.

Integration with OpenWebUI (with thinking disabled) still leaves the 27B model noticeably slower. Log measurements show roughly 70 tokens/s for 27B and 100 tokens/s for 35B, confirming the performance gap.

Qualitative observations note that code generation quality is weak for both models, and the 27B model remains inferior even after 4‑bit quantization, while the 35B model outperforms the FP8 variant.

The author concludes that, for his requirements, the stable Qwen3‑32B remains the preferred choice. Additionally, Qwen3.5 changes the <think> tag from dynamic generation to a static preset, forcing downstream systems to adapt, and the model does not support a soft toggle for thinking, limiting its appeal for enterprise applications.

DockervLLMFP8LLM deploymentGPU memory optimization4-bit quantizationQwen3.5
Old Zhang's AI Learning
Written by

Old Zhang's AI Learning

AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.