Artificial Intelligence 18 min read

Decoding GPU Server Topologies: From PCIe to NVLink for Large‑Model Training

This article provides a detailed technical overview of modern multi‑GPU server architectures—including PCIe switches, NVLink, NVSwitch, and HBM—explaining their hardware topologies, bandwidth characteristics, monitoring methods, and network choices to help engineers design efficient AI training clusters.

Architects' Tech Alliance

Apr 15, 2024

Decoding GPU Server Topologies: From PCIe to NVLink for Large‑Model Training

1. Terminology and Foundations

Large‑model training typically uses a single‑node server equipped with eight GPUs (e.g., 8×A100, 8×A800, 8×H100). The hardware topology of a typical 8‑GPU A100 node includes two CPUs, multiple PCIe Gen4 switches, six NVSwitch chips, eight GPUs, and dedicated NICs.

PCIe Switch Chip : Connects CPUs, memory, storage (NVMe), GPUs, and NICs via the PCIe bus, enabling inter‑device communication.

NVLink (definition from Wikipedia): a wire‑based, multi‑lane, near‑range communication link developed by NVIDIA that uses mesh networking instead of a central hub. It provides higher bandwidth than PCIe and supports multiple lanes, with bandwidth scaling linearly with lane count.

Short‑distance link guaranteeing packet delivery, higher performance than PCIe.

Supports multiple lanes; total bandwidth = lanes × per‑lane bandwidth.

Within a node, GPUs are connected in a full‑mesh topology.

Proprietary NVIDIA technology.

NVLink Evolution : Generations 1‑4 differ mainly in lane count and per‑lane bandwidth. For example, A100 uses 2 lanes per NVSwitch (6 NVSwitches) achieving 600 GB/s bidirectional bandwidth, while A800, with four lanes removed, reaches 400 GB/s bidirectional.

NVLink Monitoring : Real‑time bandwidth can be collected via NVIDIA DCGM (Data Center GPU Manager).

2. NVSwitch

NVSwitch is an NVIDIA switch chip packaged on the GPU module, not an external device. In a typical 8‑GPU A100 machine, six NVSwitch chips interconnect the GPUs in a full‑mesh fabric.

The full‑mesh fabric provides 12 NVLink lanes per GPU (for A100), resulting in 600 GB/s bidirectional bandwidth (300 GB/s one‑way).

3. High‑Bandwidth Memory (HBM)

HBM stacks multiple DRAM dies vertically and places them next to the GPU die, eliminating the PCIe bottleneck for GPU‑to‑memory traffic. Current HBM generations (HBM1/2/2e/3/3e) offer per‑lane bandwidths up to 8 GT/s. Major suppliers are SK Hynix and Samsung.

AMD MI300X uses 192 GB HBM3 (5.2 TB/s).

HBM3e improves bandwidth to 8 GT/s.

4. Bandwidth Units and Bottleneck Analysis

AI training performance is tightly linked to data‑transfer speeds across PCIe, NVLink, HBM, and network links. Network bandwidth is usually expressed in bits per second (b/s) and is often quoted as single‑direction (TX/RX). Other links use bytes per second (B/s) or transactions per second (T/s) and are typically reported as total bidirectional bandwidth.

5. Typical 8×A100/8×A800 Host Topology

Hardware layout (2‑2‑4‑6‑8‑8):

2 CPUs (NUMA) with attached memory.

2 storage NICs.

4 PCIe Gen4 switch chips.

6 NVSwitch chips.

8 GPUs.

8 GPU‑dedicated NICs.

GPU‑to‑GPU communication uses NVLink (full‑mesh), GPU‑to‑NIC uses PCIe Gen4 switches (64 GB/s bidirectional), and inter‑node GPU communication relies on the NICs and the external network.

Network Choices

RoCEv2 (commonly used in public‑cloud 8‑GPU servers, cost‑effective).

InfiniBand (≈20 % higher performance at roughly double the price).

6. Bandwidth Bottleneck Illustration

Key link bandwidths:

GPU‑to‑GPU (NVLink): 600 GB/s bidirectional (300 GB/s one‑way).

GPU‑to‑NIC (PCIe Gen4 switch): 64 GB/s bidirectional (32 GB/s one‑way).

Inter‑node GPU (NIC‑based): typical 100 Gbps (12.5 GB/s) per direction; higher‑speed NICs (200 Gbps or 400 Gbps) can approach or exceed PCIe Gen4 limits.

Using 400 Gbps NICs provides little benefit unless the PCIe bus is upgraded to Gen5.

7. Typical 8×H100/8×H800 Host

H100 GPUs use a SXM5 form factor with PCIe Gen5 or SXM5 interconnects. The internal layout includes 18 NVLink lanes (25 GB/s per lane) for a total of 900 GB/s bidirectional bandwidth, and four NVSwitch chips (reduced from six).

8. L40S GPU Server Overview

L40S (2023) is a cost‑effective GPU targeting the A100 market but lacks FP64 support and NVLink. It uses GDDR6 memory, reducing reliance on HBM supply.

Recommended Architecture (2‑2‑4) for a 4‑GPU L40S node:

2 CPUs (NUMA).

2 dual‑port CX7 NICs (each 200 Gbps).

4 L40S GPUs.

1 dual‑port storage NIC.

Each GPU receives roughly 200 Gbps network bandwidth. The non‑recommended 2‑2‑8 layout would require additional PCIe Gen5 switches, increasing cost and reducing per‑GPU bandwidth.

Performance Implications

Within‑node GPU bandwidth limited to 200 Gbps (≈25 GB/s) when using the external network.

A100 NVLink provides 300 GB/s one‑way, i.e., 12× higher than L40S.

L40S is unsuitable for data‑intensive large‑model training unless a 200 Gbps+ network is provisioned.

9. Testing Recommendations

Even a 4‑GPU L40S test setup must be paired with a 200 Gbps switch to realize the advertised performance.

References

NVLink‑Network Switch – NVIDIA’s Switch Chip for High‑Bandwidth SuperPODs, Hot Chips 2022.

ChatGPT Hardware – A Look at 8× NVIDIA A100 Powering the Tool, 2023.

NVIDIA Hopper Architecture In‑Depth, nvidia.com, 2022.

DGX A100 Review: Throughput and Hardware Summary, 2020.

Understanding NVIDIA GPU Performance: Utilization vs. Saturation, 2023.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

GPU AI training NVLink HBM PCIe Server topology

Written by

Architects' Tech Alliance

Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.