5 min read

Qwen3.6-35B Quantized Model on vLLM: Local Deployment and Performance Benchmark

The article details how to deploy the 4‑bit quantized Qwen3.6-35B model with vLLM 0.17 (and 0.19.1 patch) on a Docker container, compares its memory usage and token‑generation speed to Qwen3.5‑35B, and shares practical scripts and observed performance of roughly 150 tokens per second.

Old Zhang's AI Learning

Apr 20, 2026

Qwen3.6-35B Quantized Model on vLLM: Local Deployment and Performance Benchmark

The author explains that the open‑source Qwen3.6-35B‑A3B model originally weighs over 70 GB, which is demanding for a single RTX 4090, and points to a previous guide that offers 4‑bit quantized, distilled, and accelerated versions that shrink the files to around 20 GB.

For serving the model, the author prefers vLLM because it balances latency and concurrency, allowing other internal services to use the same endpoint. A prior deployment of the quantized Qwen3.5‑35B model with vLLM 0.17 worked flawlessly.

Performance testing was conducted with the model's "thinking" mode disabled; a single concurrent request achieved 148 tokens / s. All reported numbers are from this disabled‑thinking configuration.

Although the official recommendation is to start with vLLM 0.19, the author found that version 0.17 can also launch Qwen3.6‑35B successfully. The full startup script is provided below:

set -euo pipefail

MODEL_DIR="/data/models/Qwen3.6-35B-A3B-AWQ-4bit"
CONTAINER_NAME="qwen35-35b-a3b-int4"
PORT=3004

docker rm -f "${CONTAINER_NAME}" 2>/dev/null || true

docker run -d \
  --name "${CONTAINER_NAME}" \
  --gpus '"device=1,2"' \
  --ipc=host \
  --shm-size=16g \
  -p ${PORT}:8000 \
  -v "${MODEL_DIR}":/model \
  -e NCCL_P2P_DISABLE=0 \
  -e NCCL_IB_DISABLE=1 \
  --restart unless-stopped \
  vllm/vllm-openai:v0.17.0 \
  --model /model \
  --served-model-name qwen3.5-35-int4 \
  --tensor-parallel-size 2 \
  --max-model-len 102400 \
  --kv-cache-dtype fp8 \
  --gpu-memory-utilization 0.9 \
  --max-num-seqs 24 \
  --max-num-batched-tokens 8192 \
  --language-model-only \
  --enable-prefix-caching \
  --default-chat-template-kwargs '{"enable_thinking":false}' \
  --host 0.0.0.0 \
  --port 8000

The memory consumption for a 100K‑token context is illustrated in the following image:

Performance benchmarks show the Qwen3.6‑35B quantized model is marginally slower than Qwen3.5‑35B, but the difference is negligible, as depicted below:

Using the vLLM 0.19.1 patch to launch the Qwen3.6‑35B‑A5B variant resulted in a similarly slight performance drop, as shown in the next figure:

Because the internal network could not run tool‑call tests, the author performed a simple comparison of the models' programming abilities, illustrated below:

The code‑line count differs: Qwen3.5‑35B uses about 477 lines, while Qwen3.6‑35B uses roughly 256 lines. The older version relies heavily on CSS animations with many redundant or inconsistent sections, whereas the newer version employs a Canvas‑based approach to render fireworks with gravity simulation, though a positioning bug often results in a black screen.

Overall, the generation speed hovers around 150 tokens per second, which the author finds pleasant. The core of the article is the performance testing and hands‑on experience, with a recommendation to continue experimenting for more thorough evaluation.

Docker Quantization vLLM Performance Benchmark LLM deployment Qwen3.6

Written by

Old Zhang's AI Learning

AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.