How NVFP4 Quantization Supercharges LLM Inference on NVIDIA DGX
This article explains the NVFP4 4‑bit floating‑point quantization technique, shows how to deploy Qwen3‑30B‑A3B models with TensorRT‑LLM and vLLM, compares performance across NVFP4, AWQ and INT8 quantizations, and provides practical profiling commands for NVIDIA DGX systems.
NVFP4 Quantization Overview
NVFP4 is NVIDIA’s 4‑bit floating‑point format designed for large‑scale LLM inference. It compresses model weights about 3–3.5× compared with FP16 while incurring roughly 1% accuracy loss, offering a practical trade‑off between model size and precision.
TensorRT‑LLM + Qwen3‑30B‑A3B‑NVFP4
The official NVFP4‑quantized Qwen3‑30B‑A3B model (30 B parameters, 3 B activation size) can be served with TensorRT‑LLM. Model repository: https://huggingface.co/nvidia/Qwen3-30B-A3B-NVFP4
$ sudo mkdir /data/llm-model/
$ sudo chown -R $USER:$USER /data/llm-model
$ curl -LsSf https://hf.co/cli/install.sh | bash
$ export HF_ENDPOINT=https://hf-mirror.com
$ hf download nvidia/Qwen3-30B-A3B-NVFP4 \
--cache-dir /home/dgx/.cache/huggingface/hub/ \
--local-dir /data/llm-model/nvidia/Qwen3-30B-A3B-NVFP4
$ sha256sum model-00001-of-00004.safetensors
15bb083c92a763a643972134681a65e0953122df749fb0d236ca905e78e709bd model-00001-of-00004.safetensors
$ docker run -d \
--name trtllm-serve \
-v "/data/llm-model/nvidia/Qwen3-30B-A3B-NVFP4:/workspace/model" \
--gpus=all \
--ipc=host \
--network host \
--restart unless-stopped \
nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev \
trtllm-serve /workspace/model \
--backend pytorch \
--max_batch_size 10 \
--host 0.0.0.0 \
--port 8000Using --backend pytorch disables TensorRT CUDA‑Graph and kernel‑fusion optimizations, so the full acceleration potential of TensorRT‑LLM is not realized.
vLLM + Qwen3‑30B‑A3B‑NVFP4
NVFP4 support was added in vLLM 0.12.0. The following Docker‑Compose configuration runs the model on a Blackwell GPU.
services:
vllm:
image: nvcr.io/nvidia/vllm:25.10-py3
container_name: vllmv10-Qwen3-30B-A3B-NVFP4
ports:
- "8000:8000"
volumes:
- "/data/llm-model/nvidia/Qwen3-30B-A3B-NVFP4:/model"
ipc: host
command: [
"python", "-m", "vllm.entrypoints.openai.api_server",
"--model", "/model",
"--served-model-name", "Qwen3-30B",
"--trust-remote-code",
"--dtype", "auto",
"--kv-cache-memory", "34608345600"
]
environment:
- NVIDIA_VISIBLE_DEVICES=all
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
$ docker compose -p v10 -f vllm-v10_Qwen3-30B-A3B-NVFP4.yaml up -d
$ docker exec -it vllmv10-Qwen3-30B-A3B-NVFP4 vllm --version
0.10.2+9dd9ca32.nv25.10.cu130vLLM + Qwen3‑30B‑A3B‑AWQ
When an older vLLM version is required, the AWQ‑quantized model can be used as a stable alternative.
$ hf download QuixiAI/Qwen3-30B-A3B-AWQ \
--cache-dir /home/dgx/.cache/huggingface/hub/ \
--local-dir /data/llm-model/QuixiAI/Qwen3-30B-A3B-AWQ
$ cat vllm_Qwen3-30B-A3B-AWQ.yaml
services:
vllm:
image: nvcr.io/nvidia/vllm:25.10-py3
container_name: Qwen3-30B-A3B-AWQ
ports:
- "8000:8000"
volumes:
- "/data/llm-model/QuixiAI/Qwen3-30B-A3B-AWQ:/model"
ipc: host
command: [
"python", "-m", "vllm.entrypoints.openai.api_server",
"--model", "/model",
"--served-model-name", "Qwen3-30B",
"--trust-remote-code",
"--dtype", "auto",
"--kv-cache-memory", "34608345600"
]
environment:
- NVIDIA_VISIBLE_DEVICES=all
$ docker compose -p awq -f vllm_Qwen3-30B-A3B-AWQ.yaml up -dPerformance Benchmark Comparison
Benchmarks were executed with the open‑source evalscope tool (v0.7.1) using 15 concurrent streams, 100 total requests, and the OpenQA dataset.
TensorRT‑LLM + Qwen3‑30B‑A3B‑NVFP4
+-----------------------------------+-----------+
| Key | Value |
+===================================+===========+
| Time taken for tests (s) | 509.298 |
| Number of concurrency | 15 |
| Total requests | 100 |
| Succeed requests | 100 |
| Failed requests | 0 |
| Output token throughput (tok/s) | 248.171 |
| Total token throughput (tok/s) | 253.844 |
| Request throughput (req/s) | 0.1963 |
| Average latency (s) | 72.6734 |
| Avg time to first token (s) | 23.1486 |
| Avg time per output token (s) | 0.0392 |
+-----------------------------------+-----------+
vLLM + Qwen3‑30B‑A3B‑NVFP4
+-----------------------------------+-----------+
| Key | Value |
+===================================+===========+
| Time taken for tests (s) | 440.193 |
| Number of concurrency | 15 |
| Total requests | 100 |
| Succeed requests | 100 |
| Failed requests | 0 |
| Output token throughput (tok/s) | 279.916 |
| Total token throughput (tok/s) | 286.479 |
| Request throughput (req/s) | 0.2272 |
| Average latency (s) | 62.839 |
| Avg time to first token (s) | 0.1092 |
| Avg time per output token (s) | 0.0509 |
+-----------------------------------+-----------+
vLLM + Qwen3‑30B‑A3B‑AWQ
+-----------------------------------+-----------+
| Key | Value |
+===================================+===========+
| Time taken for tests (s) | 417.250 |
| Number of concurrency | 15 |
| Total requests | 100 |
| Succeed requests | 100 |
| Failed requests | 0 |
| Output token throughput (tok/s) | 284.481 |
| Total token throughput (tok/s) | 291.405 |
| Request throughput (req/s) | 0.2397 |
| Average latency (s) | 59.3749 |
| Avg time to first token (s) | 0.1258 |
| Avg time per output token (s) | 0.0499 |
+-----------------------------------+-----------+Analysis of Results
The NVFP4 stack delivers higher token‑throughput than INT8 but shows higher latency than the AWQ configuration because FlashInfer kernels required for fused MoE operations are missing in the current TensorRT‑LLM build for Blackwell GPUs. Consequently, the theoretical advantages of FP4 are not fully realized. Both TensorRT‑LLM and vLLM are still maturing support for the new SM 12.1 (GB10) architecture; batch‑size and KV‑cache tuning guidelines have not been published, so practitioners must rely on empirical testing.
Profiling Tools (Nsight Systems & Compute)
Performance investigation on a DGX can be performed with Nsight Systems (nsys) and Nsight Compute. Two capture modes are common:
Mode 1 – In‑container capture : launch the inference container with nsys launch and start profiling from inside the container.
Mode 2 – Host‑side capture : run nsys profile on the host; this captures GPU metrics but not detailed CUDA traces.
# Mode 1 – container launch with nsys
$ docker run -d \
--name trtllm-serve-nsys \
-v "/data/llm-model/nvidia/Qwen3-30B-A3B-NVFP4:/workspace/model" \
--gpus=all \
--ipc=host \
--network host \
--restart unless-stopped \
--privileged \
nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev \
nsys launch --trace=cuda,osrt,nvtx,cublas \
--session-new=my-session --show-output=true \
trtllm-serve /workspace/model \
--backend pytorch --max_batch_size 10 --host 0.0.0.0 --port 8000
$ nsys start --backtrace=none --sample=system-wide \
--gpu-metrics-device=all --gpu-metrics-frequency=100000 \
--session=my-session --output=trtllm02.nsys-rep
# Run the same evalscope workload
$ evalscope perf \
--url "http://127.0.0.1:8000/v1/chat/completions" \
--parallel 15 \
--model Qwen3-30B \
--number 100 \
--api openai \
--dataset openqa \
--stream
$ nsys stop --session=my-session
# Mode 2 – host‑side capture
$ sudo nsys profile --sample=system-wide --gpu-metrics-device=all \
--gpu-metrics-frequency=100000 --duration=10 -o trtllm01How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
