Why NVLink Beats PCIe for AI: Deep Dive into GPU Interconnect Technologies
This article examines the architectural differences between Scale‑Out and Scale‑Up networking, compares PCIe, NVLink, UALink, Infiniband and RoCE, and explains why high‑bandwidth, low‑latency GPU interconnects like NVLink are essential for modern AI and HPC workloads.
Overview
In AI data centers, two networking paradigms coexist: Scale‑Out (inter‑node) and Scale‑Up (intra‑node). Scale‑Out relies on Ethernet or InfiniBand to connect thousands of servers, while Scale‑Up uses technologies such as PCIe, NVLink, CXL, and UALink to link GPUs and accelerators within a single server.
Scale‑Up Technologies
PCIe is a universal bus that connects CPUs, GPUs, NICs and storage. Its bandwidth depends on generation (Gen) and lane count (xN). Although PCIe 6.0/7.0 promise up to 256 GB/s per x16 link, latency and contention become bottlenecks for large‑scale GPU training.
NVLink is NVIDIA’s proprietary high‑speed GPU‑to‑GPU link. Since its first generation in 2014, NVLink bandwidth has grown from 160 GB/s (NVLink 1.0) to 1.8 TB/s (NVLink 5.0), offering up to 900 GB/s per link and supporting thousands of GPUs via NVSwitch. NVLink provides a unified memory address space, cache‑coherent communication, and low‑latency RDMA‑like transfers.
CXL and PCIe‑based UALink (Ultra Accelerator Link) are emerging standards that aim to deliver cache‑coherent memory sharing across CPUs and accelerators, with lane speeds of 200 GT/s and support for up to 1024 devices per fabric.
Scale‑Out Technologies
Infiniband offers RDMA‑enabled low‑latency, high‑bandwidth inter‑node networking, typically delivering hundreds of GB/s across racks. It is widely used in HPC and large‑scale AI training clusters despite higher cost.
RoCE (RDMA over Converged Ethernet) provides similar RDMA capabilities over standard Ethernet hardware, reducing cost while maintaining low latency for multi‑node GPU communication.
GPUDirect (including GPUDirect‑RDMA) allows third‑party PCIe devices such as InfiniBand adapters to access GPU memory directly, bypassing the CPU and improving data‑transfer efficiency.
Performance Comparison
PCIe 5.0 x16 delivers ~128 GB/s, whereas NVLink 4.0 can reach 900 GB/s—over seven times faster. Infiniband and RoCE fill the gap for inter‑node traffic, while NVLink dominates intra‑node bandwidth‑critical workloads.
Emerging Challengers
UALink : an open‑source accelerator interconnect backed by AMD, AWS, and others, targeting 200 GT/s per lane and supporting up to 1024 accelerators.
UB‑Mesh (Huawei) : a unified mesh fabric that promises 10 Tbps per node and sub‑microsecond latency, aiming to replace multiple proprietary protocols with a single open standard.
SUE (Broadcom) : a Scale‑Up Ethernet specification that leverages standard 200/400/800 GbE PHYs, offering AXI‑compatible interfaces, FEC, and low‑latency flow control while remaining compatible with existing Ethernet ecosystems.
Conclusion
For AI and HPC workloads that require massive intra‑node data movement, NVLink (or equivalent cache‑coherent high‑speed fabrics such as UALink or UB‑Mesh) provides the necessary bandwidth and latency advantages over traditional PCIe. For inter‑node communication, Infiniband and RoCE remain the preferred choices, with GPUDirect‑RDMA bridging the gap between the two domains.
Architects' Tech Alliance
Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
