Unlocking GPU Computing: PCIe, NVLink, NVSwitch, and HBM Explained
This article breaks down the core components of high‑performance GPU servers—including PCIe switch chips, the evolution of NVLink from version 1.0 to 4.0, NVSwitch architecture, HBM memory tiers, and the nuances of bandwidth units—providing a comprehensive technical foundation for large‑scale model training.
GPU Server Topology
In large‑scale model training, high‑performance GPU servers typically consist of a single chassis housing eight GPUs such as the A100, A800, H100, or H800, with future models like the L40S expected. The internal GPU compute hardware topology forms a full‑mesh network of GPUs.
PCIe Switch Chip
PCIe (Peripheral Component Interconnect Express) is the primary bus linking CPUs, memory modules, NVMe storage, GPUs, and network adapters. The latest Gen5 specification provides significantly higher inter‑device throughput, making PCIe a pivotal component in modern high‑performance computing clusters.
NVLink Overview
Definition
NVLink is NVIDIA’s proprietary high‑speed interconnect and communication protocol introduced in March 2014. It uses a point‑to‑point serial topology that can connect a CPU to a GPU or link multiple GPUs directly, offering multiple links per device and a mesh‑style network rather than a central hub.
Evolution (NVLink 1.0 – 4.0)
NVLink 1.0 : 4 channels, up to 160 GB/s bidirectional bandwidth.
NVLink 2.0 : 6 channels, up to 300 GB/s bidirectional bandwidth.
NVLink 3.0 : 12 channels, up to 600 GB/s bidirectional bandwidth.
NVLink 4.0 : 18 channels, up to 900 GB/s bidirectional bandwidth.
NVSwitch
NVSwitch is NVIDIA’s dedicated switch chip for intra‑node communication among multiple GPUs. In an 8‑GPU A100 configuration the NVSwitch sits beneath the large heat sinks, providing low‑latency, high‑throughput full‑mesh connectivity.
NVLink Switch
The term “NVLink switch” originally referred to on‑board switching logic within a GPU module. In 2022 NVIDIA released an independent NVLink switch product, distinct from NVSwitch, to enable high‑performance GPU communication across separate hosts.
HBM (High‑Bandwidth Memory)
Traditional GPU memory uses DDR chips accessed via PCIe, limiting bandwidth to 64 GB/s (Gen4) or 128 GB/s (Gen5). HBM stacks multiple DDR dies directly on the GPU die, eliminating the PCIe bottleneck and increasing data‑transfer rates by orders of magnitude, as demonstrated in NVIDIA’s H100 architecture.
HBM Evolution
Bandwidth Unit Analysis
When evaluating GPU‑centric systems, several bandwidth metrics must be considered: PCIe, memory, NVLink, HBM, and network links. Network speeds are expressed in bits per second (bit/s) with separate TX/RX values, while PCIe, memory, NVLink, and HBM use bytes per second (Byte/s) or transactions per second (T/s), representing combined bidirectional capacity. Accurate conversion and comparison of these units are essential for understanding data‑transfer limits that impact large‑scale GPU training performance.
Source: https://community.fs.com/cn/article/unveiling-the-foundations-of-gpu-computing1.html
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Architects' Tech Alliance
Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
