How NVFP4 Quantization Supercharges LLM Inference on NVIDIA DGX

This article explains the NVFP4 4‑bit floating‑point quantization technique, shows how to deploy Qwen3‑30B‑A3B models with TensorRT‑LLM and vLLM, compares performance across NVFP4, AWQ and INT8 quantizations, and provides practical profiling commands for NVIDIA DGX systems.

AI Cyberspace
AI Cyberspace
AI Cyberspace
How NVFP4 Quantization Supercharges LLM Inference on NVIDIA DGX

NVFP4 Quantization Overview

NVFP4 is NVIDIA’s 4‑bit floating‑point format designed for large‑scale LLM inference. It compresses model weights about 3–3.5× compared with FP16 while incurring roughly 1% accuracy loss, offering a practical trade‑off between model size and precision.

NVFP4 quantization illustration
NVFP4 quantization illustration

TensorRT‑LLM + Qwen3‑30B‑A3B‑NVFP4

The official NVFP4‑quantized Qwen3‑30B‑A3B model (30 B parameters, 3 B activation size) can be served with TensorRT‑LLM. Model repository: https://huggingface.co/nvidia/Qwen3-30B-A3B-NVFP4

$ sudo mkdir /data/llm-model/
$ sudo chown -R $USER:$USER /data/llm-model
$ curl -LsSf https://hf.co/cli/install.sh | bash
$ export HF_ENDPOINT=https://hf-mirror.com
$ hf download nvidia/Qwen3-30B-A3B-NVFP4 \
    --cache-dir /home/dgx/.cache/huggingface/hub/ \
    --local-dir /data/llm-model/nvidia/Qwen3-30B-A3B-NVFP4
$ sha256sum model-00001-of-00004.safetensors
15bb083c92a763a643972134681a65e0953122df749fb0d236ca905e78e709bd  model-00001-of-00004.safetensors
$ docker run -d \
  --name trtllm-serve \
  -v "/data/llm-model/nvidia/Qwen3-30B-A3B-NVFP4:/workspace/model" \
  --gpus=all \
  --ipc=host \
  --network host \
  --restart unless-stopped \
  nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev \
  trtllm-serve /workspace/model \
  --backend pytorch \
  --max_batch_size 10 \
  --host 0.0.0.0 \
  --port 8000

Using --backend pytorch disables TensorRT CUDA‑Graph and kernel‑fusion optimizations, so the full acceleration potential of TensorRT‑LLM is not realized.

vLLM + Qwen3‑30B‑A3B‑NVFP4

NVFP4 support was added in vLLM 0.12.0. The following Docker‑Compose configuration runs the model on a Blackwell GPU.

services:
  vllm:
    image: nvcr.io/nvidia/vllm:25.10-py3
    container_name: vllmv10-Qwen3-30B-A3B-NVFP4
    ports:
      - "8000:8000"
    volumes:
      - "/data/llm-model/nvidia/Qwen3-30B-A3B-NVFP4:/model"
    ipc: host
    command: [
      "python", "-m", "vllm.entrypoints.openai.api_server",
      "--model", "/model",
      "--served-model-name", "Qwen3-30B",
      "--trust-remote-code",
      "--dtype", "auto",
      "--kv-cache-memory", "34608345600"
    ]
    environment:
      - NVIDIA_VISIBLE_DEVICES=all
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
$ docker compose -p v10 -f vllm-v10_Qwen3-30B-A3B-NVFP4.yaml up -d
$ docker exec -it vllmv10-Qwen3-30B-A3B-NVFP4 vllm --version
0.10.2+9dd9ca32.nv25.10.cu130

vLLM + Qwen3‑30B‑A3B‑AWQ

When an older vLLM version is required, the AWQ‑quantized model can be used as a stable alternative.

$ hf download QuixiAI/Qwen3-30B-A3B-AWQ \
    --cache-dir /home/dgx/.cache/huggingface/hub/ \
    --local-dir /data/llm-model/QuixiAI/Qwen3-30B-A3B-AWQ
$ cat vllm_Qwen3-30B-A3B-AWQ.yaml
services:
  vllm:
    image: nvcr.io/nvidia/vllm:25.10-py3
    container_name: Qwen3-30B-A3B-AWQ
    ports:
      - "8000:8000"
    volumes:
      - "/data/llm-model/QuixiAI/Qwen3-30B-A3B-AWQ:/model"
    ipc: host
    command: [
      "python", "-m", "vllm.entrypoints.openai.api_server",
      "--model", "/model",
      "--served-model-name", "Qwen3-30B",
      "--trust-remote-code",
      "--dtype", "auto",
      "--kv-cache-memory", "34608345600"
    ]
    environment:
      - NVIDIA_VISIBLE_DEVICES=all
$ docker compose -p awq -f vllm_Qwen3-30B-A3B-AWQ.yaml up -d

Performance Benchmark Comparison

Benchmarks were executed with the open‑source evalscope tool (v0.7.1) using 15 concurrent streams, 100 total requests, and the OpenQA dataset.

TensorRT‑LLM + Qwen3‑30B‑A3B‑NVFP4
+-----------------------------------+-----------+
| Key                               | Value     |
+===================================+===========+
| Time taken for tests (s)          | 509.298   |
| Number of concurrency             | 15        |
| Total requests                    | 100       |
| Succeed requests                  | 100       |
| Failed requests                   | 0         |
| Output token throughput (tok/s)   | 248.171   |
| Total token throughput (tok/s)    | 253.844   |
| Request throughput (req/s)        | 0.1963    |
| Average latency (s)               | 72.6734   |
| Avg time to first token (s)       | 23.1486   |
| Avg time per output token (s)     | 0.0392    |
+-----------------------------------+-----------+

vLLM + Qwen3‑30B‑A3B‑NVFP4
+-----------------------------------+-----------+
| Key                               | Value     |
+===================================+===========+
| Time taken for tests (s)          | 440.193   |
| Number of concurrency             | 15        |
| Total requests                    | 100       |
| Succeed requests                  | 100       |
| Failed requests                   | 0         |
| Output token throughput (tok/s)   | 279.916   |
| Total token throughput (tok/s)    | 286.479   |
| Request throughput (req/s)        | 0.2272    |
| Average latency (s)               | 62.839    |
| Avg time to first token (s)       | 0.1092    |
| Avg time per output token (s)     | 0.0509    |
+-----------------------------------+-----------+

vLLM + Qwen3‑30B‑A3B‑AWQ
+-----------------------------------+-----------+
| Key                               | Value     |
+===================================+===========+
| Time taken for tests (s)          | 417.250   |
| Number of concurrency             | 15        |
| Total requests                    | 100       |
| Succeed requests                  | 100       |
| Failed requests                   | 0         |
| Output token throughput (tok/s)   | 284.481   |
| Total token throughput (tok/s)    | 291.405   |
| Request throughput (req/s)        | 0.2397    |
| Average latency (s)               | 59.3749   |
| Avg time to first token (s)       | 0.1258    |
| Avg time per output token (s)     | 0.0499    |
+-----------------------------------+-----------+
Benchmark summary chart
Benchmark summary chart

Analysis of Results

The NVFP4 stack delivers higher token‑throughput than INT8 but shows higher latency than the AWQ configuration because FlashInfer kernels required for fused MoE operations are missing in the current TensorRT‑LLM build for Blackwell GPUs. Consequently, the theoretical advantages of FP4 are not fully realized. Both TensorRT‑LLM and vLLM are still maturing support for the new SM 12.1 (GB10) architecture; batch‑size and KV‑cache tuning guidelines have not been published, so practitioners must rely on empirical testing.

Profiling Tools (Nsight Systems & Compute)

Performance investigation on a DGX can be performed with Nsight Systems (nsys) and Nsight Compute. Two capture modes are common:

Mode 1 – In‑container capture : launch the inference container with nsys launch and start profiling from inside the container.

Mode 2 – Host‑side capture : run nsys profile on the host; this captures GPU metrics but not detailed CUDA traces.

# Mode 1 – container launch with nsys
$ docker run -d \
  --name trtllm-serve-nsys \
  -v "/data/llm-model/nvidia/Qwen3-30B-A3B-NVFP4:/workspace/model" \
  --gpus=all \
  --ipc=host \
  --network host \
  --restart unless-stopped \
  --privileged \
  nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev \
  nsys launch --trace=cuda,osrt,nvtx,cublas \
    --session-new=my-session --show-output=true \
    trtllm-serve /workspace/model \
    --backend pytorch --max_batch_size 10 --host 0.0.0.0 --port 8000

$ nsys start --backtrace=none --sample=system-wide \
    --gpu-metrics-device=all --gpu-metrics-frequency=100000 \
    --session=my-session --output=trtllm02.nsys-rep

# Run the same evalscope workload
$ evalscope perf \
    --url "http://127.0.0.1:8000/v1/chat/completions" \
    --parallel 15 \
    --model Qwen3-30B \
    --number 100 \
    --api openai \
    --dataset openqa \
    --stream

$ nsys stop --session=my-session

# Mode 2 – host‑side capture
$ sudo nsys profile --sample=system-wide --gpu-metrics-device=all \
    --gpu-metrics-frequency=100000 --duration=10 -o trtllm01
Nsight System timeline view
Nsight System timeline view
LLMQuantizationvLLMPerformance BenchmarkInferenceTensorRT-LLMNVFP4NVIDIA DGX
AI Cyberspace
Written by

AI Cyberspace

AI, big data, cloud computing, and networking.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.