Testing NVIDIA‑Accelerated Qwen3.6‑35B on Dual RTX 4090: Real‑World Performance

This article evaluates the Red Hat‑produced NVFP4‑quantized Qwen3.6‑35B model deployed with vLLM inside Docker on a dual‑RTX 4090 server, presenting accuracy gains, memory usage, initialization times, GPU compatibility notes, and practical deployment recommendations.

Old Zhang's AI Learning
Old Zhang's AI Learning
Old Zhang's AI Learning
Testing NVIDIA‑Accelerated Qwen3.6‑35B on Dual RTX 4090: Real‑World Performance

NVFP4 Quantized Version by Red Hat

The NVFP4 variant of Qwen3.6‑35B‑A3B was quantized to 4‑bit floating point (W4A4) using the llm‑compressor library, which optimizes quantization for vLLM inference and supports methods such as GPTQ, AWQ, SmoothQuant, FP8, and NVFP4.

llm‑compressor is a quantization toolkit under the vLLM project, specifically tuned for inference acceleration.

Red Hat evaluated the model on the GSM8K Platinum benchmark and reported the following results:

Original BF16 version accuracy: 95.62%

NVFP4 quantized version accuracy: 96.28%

Recovery rate: 100.69%

The quantized model slightly outperforms the original, indicating that NVFP4 quantization introduces negligible accuracy loss.

Deployment: vLLM + Docker

The model was launched on a server equipped with two RTX 4090 GPUs using Docker and vLLM version 0.19.1. The Docker run command is:

docker run -d --name qwen36-35b-a3b-int4 \
  --gpus all \
  -v /data/llm-models/Qwen3.6-35B-A3B-NVFP4:/model \
  -p 8000:8000 \
  vllm/vllm-openai:v0.19.1 \
  --model /model \
  --served-model-name qwen3.6-35-int4 \
  --tensor-parallel-size 2 \
  --max-model-len 102400 \
  --kv-cache-dtype fp8 \
  --enable-prefix-caching \
  --language-model-only \
  --max-num-batched-tokens 8192 \
  --max-num-seqs 24 \
  --default-chat-template-kwargs '{"enable_thinking": false}'

Key parameters explained: --tensor-parallel-size 2: enables tensor parallelism across two GPUs. --kv-cache-dtype fp8: stores KV cache in FP8 to save VRAM. --language-model-only: skips the visual encoder, allocating more memory to KV cache. --enable-prefix-caching: activates prefix caching for faster reuse. --default-chat-template-kwargs '{"enable_thinking": false}': disables the thinking mode by default.

Deployment Metrics

vLLM version: 0.19.1

Model loading time: 24 seconds

Per‑GPU memory consumption: 10.61 GiB

torch.compile compilation time: 39.49 seconds

Total initialization time: 136.49 seconds

GPU KV cache capacity: 494,656 tokens

Maximum concurrency (102K context): 17.18×

CUDA Graph memory overhead: 0.81 GiB

NVFP4 on Non‑Blackwell GPUs

The runtime prints a warning when the GPU lacks native FP4 support. On Ada‑based GPUs (e.g., RTX 4090), vLLM falls back to the Marlin kernel, performing weight‑only FP4 decompression, which eliminates the activation‑level speedup and leaves memory savings as the primary benefit.

Blackwell (B100/B200): native FP4 support → full W4A4 acceleration.

Hopper (H100/H200): no native support → weight‑only + Marlin decompression.

Ada (L40S/4090): no native support → weight‑only + Marlin decompression.

Additional Details

Mamba cache support is experimental; vLLM’s prefix caching for Gated DeltaNet layers is still under development.

Custom AllReduce is disabled because the GPUs lack P2P connectivity, causing a fallback to NCCL communication with a slight efficiency loss.

Deployment Recommendations

Hardware : Minimum two RTX 4090 (24 GB each) to handle 100K context length; Blackwell GPUs provide the full NVFP4 acceleration.

Inference framework : vLLM 0.19.0 or newer (0.19.1 recommended); SGLang and KTransformers are also supported.

Sampling parameters :

Thinking mode: temperature=1.0, top_p=0.95, top_k=20, presence_penalty=1.5 Precise programming tasks: temperature=0.6, top_p=0.95, top_k=20, presence_penalty=0.0 Non‑thinking mode: temperature=0.7, top_p=0.8, top_k=20, presence_penalty=1.5 Agent scenarios : Enable preserve_thinking to retain the reasoning chain across multiple turns, reducing token waste.

DockerQuantizationvLLMRTX 4090NVFP4Qwen3.6-35B
Old Zhang's AI Learning
Written by

Old Zhang's AI Learning

AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.