Testing NVIDIA‑Accelerated Qwen3.6‑35B on Dual RTX 4090: Real‑World Performance
This article evaluates the Red Hat‑produced NVFP4‑quantized Qwen3.6‑35B model deployed with vLLM inside Docker on a dual‑RTX 4090 server, presenting accuracy gains, memory usage, initialization times, GPU compatibility notes, and practical deployment recommendations.
NVFP4 Quantized Version by Red Hat
The NVFP4 variant of Qwen3.6‑35B‑A3B was quantized to 4‑bit floating point (W4A4) using the llm‑compressor library, which optimizes quantization for vLLM inference and supports methods such as GPTQ, AWQ, SmoothQuant, FP8, and NVFP4.
llm‑compressor is a quantization toolkit under the vLLM project, specifically tuned for inference acceleration.
Red Hat evaluated the model on the GSM8K Platinum benchmark and reported the following results:
Original BF16 version accuracy: 95.62%
NVFP4 quantized version accuracy: 96.28%
Recovery rate: 100.69%
The quantized model slightly outperforms the original, indicating that NVFP4 quantization introduces negligible accuracy loss.
Deployment: vLLM + Docker
The model was launched on a server equipped with two RTX 4090 GPUs using Docker and vLLM version 0.19.1. The Docker run command is:
docker run -d --name qwen36-35b-a3b-int4 \
--gpus all \
-v /data/llm-models/Qwen3.6-35B-A3B-NVFP4:/model \
-p 8000:8000 \
vllm/vllm-openai:v0.19.1 \
--model /model \
--served-model-name qwen3.6-35-int4 \
--tensor-parallel-size 2 \
--max-model-len 102400 \
--kv-cache-dtype fp8 \
--enable-prefix-caching \
--language-model-only \
--max-num-batched-tokens 8192 \
--max-num-seqs 24 \
--default-chat-template-kwargs '{"enable_thinking": false}'Key parameters explained: --tensor-parallel-size 2: enables tensor parallelism across two GPUs. --kv-cache-dtype fp8: stores KV cache in FP8 to save VRAM. --language-model-only: skips the visual encoder, allocating more memory to KV cache. --enable-prefix-caching: activates prefix caching for faster reuse. --default-chat-template-kwargs '{"enable_thinking": false}': disables the thinking mode by default.
Deployment Metrics
vLLM version: 0.19.1
Model loading time: 24 seconds
Per‑GPU memory consumption: 10.61 GiB
torch.compile compilation time: 39.49 seconds
Total initialization time: 136.49 seconds
GPU KV cache capacity: 494,656 tokens
Maximum concurrency (102K context): 17.18×
CUDA Graph memory overhead: 0.81 GiB
NVFP4 on Non‑Blackwell GPUs
The runtime prints a warning when the GPU lacks native FP4 support. On Ada‑based GPUs (e.g., RTX 4090), vLLM falls back to the Marlin kernel, performing weight‑only FP4 decompression, which eliminates the activation‑level speedup and leaves memory savings as the primary benefit.
Blackwell (B100/B200): native FP4 support → full W4A4 acceleration.
Hopper (H100/H200): no native support → weight‑only + Marlin decompression.
Ada (L40S/4090): no native support → weight‑only + Marlin decompression.
Additional Details
Mamba cache support is experimental; vLLM’s prefix caching for Gated DeltaNet layers is still under development.
Custom AllReduce is disabled because the GPUs lack P2P connectivity, causing a fallback to NCCL communication with a slight efficiency loss.
Deployment Recommendations
Hardware : Minimum two RTX 4090 (24 GB each) to handle 100K context length; Blackwell GPUs provide the full NVFP4 acceleration.
Inference framework : vLLM 0.19.0 or newer (0.19.1 recommended); SGLang and KTransformers are also supported.
Sampling parameters :
Thinking mode: temperature=1.0, top_p=0.95, top_k=20, presence_penalty=1.5 Precise programming tasks: temperature=0.6, top_p=0.95, top_k=20, presence_penalty=0.0 Non‑thinking mode: temperature=0.7, top_p=0.8, top_k=20, presence_penalty=1.5 Agent scenarios : Enable preserve_thinking to retain the reasoning chain across multiple turns, reducing token waste.
Old Zhang's AI Learning
AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
