Fundamentals 11 min read

Fundamentals of GPU Computing: PCIe, NVLink, NVSwitch, and HBM

This article provides a comprehensive overview of the core components and terminology of large‑scale GPU computing, covering GPU server architecture, PCIe interconnects, NVLink generations, NVSwitch, high‑bandwidth memory (HBM), and bandwidth unit considerations for AI and HPC workloads.

Architects' Tech Alliance

May 14, 2024

Fundamentals of GPU Computing: PCIe, NVLink, NVSwitch, and HBM

PCIe Switch Chip

In high‑performance GPU computing, key components such as CPUs, memory modules, NVMe storage, GPUs, and network adapters are connected via the PCIe (Peripheral Component Interconnect Express) bus or dedicated PCIe switch chips, with the latest Gen5 version delivering extremely efficient inter‑device communication.

NVLink Overview

NVLink Definition

NVLink is NVIDIA's proprietary high‑speed bus and communication protocol. It uses a point‑to‑point, serial topology to connect CPUs and GPUs or multiple GPUs directly, allowing multiple NVLink links per device and employing a mesh network rather than a central hub. First released in March 2014, it utilizes the NVHS signaling technology.

The technology enables full interconnectivity among GPUs within the same node and has evolved through several generations to increase bidirectional bandwidth for high‑performance computing applications.

NVLink Evolution: From 1.0 to 4.0

NVLink's evolution is illustrated in the diagram below.

NVLink 1.0

Connection method: 4‑lane link.

Total bandwidth: up to 160 GB/s bidirectional.

Use case: accelerate data transfer between GPUs to improve collaborative compute performance.

NVLink 2.0

Connection method: 6‑lane link.

Total bandwidth: increased to 300 GB/s bidirectional.

Performance boost: higher data‑transfer rates and improved GPU‑to‑GPU communication efficiency.

NVLink 3.0

Connection method: 12‑lane link.

Total bandwidth: 600 GB/s bidirectional.

New features: additional protocols and techniques to further raise bandwidth and efficiency.

NVLink 4.0

Connection method: 18‑lane link.

Total bandwidth: up to 900 GB/s bidirectional.

Performance improvement: meets the bandwidth demands of modern AI and high‑performance compute workloads.

The key differences among NVLink 1.0‑4.0 lie in the increasing number of lanes, supported bandwidth, and resulting performance gains, continuously optimizing GPU‑to‑GPU data transfer for increasingly demanding applications.

NVSwitch

NVSwitch is an NVIDIA‑designed switch chip that provides high‑speed, low‑latency communication among multiple GPUs within a single host, targeting high‑performance computing and AI workloads.

The diagram below shows a typical 8‑GPU A100 system where an NVSwitch chip is hidden beneath six large heat sinks, tightly coupling the eight GPUs for efficient data exchange.

NVLink Switch

NVLink Switch is a standalone switching device introduced by NVIDIA in 2022 to enable high‑performance GPU communication across multiple hosts, distinct from the NVSwitch that is integrated inside a single host. Early references to “NVLink Switch” actually described the on‑board switch chip; the 2022 product is a separate, rack‑mountable unit.

HBM (High‑Bandwidth Memory)

Traditional GPU memory uses DDR modules connected via PCIe, which creates bandwidth bottlenecks (64 GB/s for PCIe Gen4, 128 GB/s for Gen5). To overcome this, GPU vendors stack multiple DDR dies into High‑Bandwidth Memory (HBM), allowing GPUs (e.g., NVIDIA H100) to connect directly to HBM without PCIe, dramatically increasing data‑transfer rates.

HBM Evolution: From HBM1 to HBM3e

Bandwidth Unit Analysis

In large‑scale GPU training, system performance depends on several bandwidth channels: PCIe, memory, NVLink, HBM, and network. Network bandwidth is usually expressed in bits per second (bit/s) with separate TX/RX directions, while PCIe, memory, NVLink, and HBM are measured in bytes per second (Byte/s) or transactions per second (T/s), typically representing total bidirectional capacity.

Accurately identifying and converting these units is essential for a comprehensive understanding of data‑transfer capabilities that affect GPU training performance.

Source: https://community.fs.com/cn/article/unveiling-the-foundations-of-gpu-computing1.html

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

High-performance computing GPU computing AI hardware NVLink HBM PCIe

Written by

Architects' Tech Alliance

Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.