Industry Insights 5 min read

Why NVLink Supercharges Llama 3 70B Inference: A Deep Performance Breakdown

An in‑depth analysis shows that NVLink 3.0 reduces all‑reduce communication latency for Llama 3 70B inference from over 1.8 seconds to under 100 ms, delivering a dramatic speedup compared with PCIe 4.0 and highlighting the critical role of high‑bandwidth interconnects in large‑model deployments.

Baobao Algorithm Notes

Jan 14, 2025

Why NVLink Supercharges Llama 3 70B Inference: A Deep Performance Breakdown

When running a 70‑billion‑parameter Llama 3 model (TP4) on an RTX 4090 using FP8 tensor cores, the author discovered that communication overhead dominates inference latency, especially during the all‑reduce steps required between GPUs.

The model has 80 layers, each performing two all‑reduce operations. With a hidden dimension of 8192, a single token occupies 16 384 bytes in FP16. The total data transferred per token is therefore substantial.

Using nccl-tests, the author measured the latency of a single all‑reduce as a function of token count. For an input of 4 096 tokens, NVLink 3.0 incurred 603 µs per all‑reduce, while PCIe 4.0 required 11 369 µs. Multiplying by 80 layers and two reductions per layer yields total communication costs of 96.48 ms (NVLink) versus 1 819.04 ms (PCIe).

Because the compute portion of the workload is roughly 800 ms, the PCIe‑based communication cost dwarfs the useful work, leaving the GPU idle for most of the inference time.

To validate these figures, the author ran vLLM 0.6.6 on a 4 × A100 setup, disabling NVLink with NCCL_P2P_DISABLE=1. With NVLink enabled, prefill latency was about 878.57 ms; disabling NVLink increased latency to 2 740.17 ms, confirming a ~1.9‑second communication penalty that matches the nccl‑test results.

Further calculations show that for a 4 096‑token input, the total all‑reduce data volume is 384 MB (N × 2 × (D − D/N) with N = 4 GPUs and D = 64 MB). NVLink delivers an effective bandwidth of roughly 631.58 GB/s (384 MB / 603 µs), whereas PCIe provides only about 33.77 GB/s (384 MB / 11 369 µs), aligning with the specifications of NVLink 3.0 and PCIe 4.0.

The study concludes that fast interconnects like NVLink are essential for extracting the full potential of modern GPUs in large‑model inference; without them, communication becomes the bottleneck, rendering much of the GPU’s compute capability unusable.