Why 8‑GPU Servers Are Essential for LLM Training and Which Interconnect Wins

With modern large‑language‑model workloads demanding massive parallelism, 8‑GPU servers have become the norm; this article explains the roles of CPUs, compares GPU‑to‑GPU interconnect options—including PCIe direct, PCIe Switch, NVLink, and NVSwitch—detailing their architectures, bandwidths, topologies, and trade‑offs for AI training.

AI Cyberspace
AI Cyberspace
AI Cyberspace
Why 8‑GPU Servers Are Essential for LLM Training and Which Interconnect Wins

8‑GPU Server Overview

As model complexity grows, a single GPU cannot handle training tasks, making multi‑GPU servers—especially 8‑GPU configurations—the new norm in the LLM era.

In an 8‑GPU server, the CPU handles system management, task scheduling, and logical operations, while GPUs perform large‑scale parallel computation.

GPU Interconnect Technology Classification

During deep‑learning training, the speed of data transfer between GPUs becomes a bottleneck. From a network perspective, interconnects can be divided into GPU‑to‑GPU and server‑level GPU interconnects. This article focuses on intra‑node GPU‑to‑GPU interconnects, which fall into four categories:

PCIe Direct

PCIe Switch

NVLink

NVSwitch Full Mesh

PCIe Direct

PCIe direct means each GPU is connected directly to the CPU without a PCIe switch, but this approach suffers from limited total PCIe lanes.

Example: Gooxi AMD Milan 4U 8‑GPU AI server:

Two third‑generation AMD CPUs, each providing 128 PCIe lanes (total 256 lanes; Intel third‑gen CPUs provide only 64 lanes each).

The CPUs are linked by three xGMI connections, consuming 32×3 lanes, leaving 160 lanes for NICs, GPUs, etc.

Each GPU uses 16 lanes; eight GPUs consume 128 lanes, leaving 32 lanes.

The remaining 32 lanes can be allocated to other NICs or RAID cards.

PCIe Switch Interconnect

To alleviate lane scarcity, newer PCIe standards introduce PCIe switch modules that expand lane availability. A PCIe switch interconnect connects multiple GPUs to the CPU via a switch, adding some latency.

A PCIe switch occupies only 16 CPU lanes and can expand each slot to five x16 slots, enabling various topologies (balance, common, cascade).

Although PCIe 5.x x16 offers up to 126 GB/s bidirectional bandwidth, it remains slower than GPU demands, and the switch‑based topology can introduce multi‑hop latency, making it more suitable for inference or cost‑effective cloud scenarios.

NVLink Interconnect

Introduced by NVIDIA in 2014, NVLink replaces PCIe to provide high‑bandwidth, low‑latency GPU‑to‑GPU connections.

NVLink creates point‑to‑point links; each link is a full‑duplex channel. A single NVLink can connect two GPUs, and each GPU may have multiple NVLink ports.

With the fourth‑generation NVLink and Hopper architecture, a single link delivers up to 900 GB/s bidirectional bandwidth—seven times faster than PCIe 5.x x16. Each GPU offers 18 NVLink ports.

NVLink 1.0 and the DGX‑1 System

In 2014, NVLink 1.0 launched alongside the P100 GPU and the DGX‑1 8‑GPU server.

Each P100 provides four NVLink ports, each delivering 40 GB/s bidirectional (total 160 GB/s per GPU), five times the bandwidth of PCIe 3.x x16.

The eight GPUs are split into two groups of four, forming a Cube‑Mesh topology.

Because the required PCIe lanes exceed what two Intel Xeon CPUs can supply, each group connects to a PCIe switch linked to a CPU, resulting in the classic NVLink topology shown.

The DGX‑1 integrates all NVLink connections on the motherboard, providing a clean wiring scheme.

NVLink 2.0 and the DGX‑1 System

Released in 2017 with the V100 GPU, NVLink 2.0 doubles link bandwidth to 50 GB/s (400 Gb/s) and adds six links per GPU, raising per‑GPU bandwidth to 300 GB/s.

Applying NVLink 2.0 to DGX‑1 retains the Cube‑Mesh topology, with the additional links further increasing inter‑GPU bandwidth.

NVSwitch Full Mesh

NVSwitch 1.0 and the DGX‑2 System

NVLink’s Cube‑Mesh cannot achieve full GPU‑to‑GPU connectivity. NVIDIA’s 2018 DGX‑2 introduced NVSwitch 1.0, a high‑speed ASIC switch that acts as a hub for multiple NVLink connections, enabling full mesh.

The DGX‑2 features 16 V100 GPUs, 32 GB HBM2 per GPU, and dual 2.7 GHz 24‑core Xeon CPUs. NVSwitch provides 18 ports, each 50 GB/s bidirectional, totaling 900 GB/s. Two ports connect to CPUs; the remaining 16 connect to GPUs, achieving a full‑mesh among eight GPUs with just six NVSwitch chips.

For DGX‑2, two GPU boards (16 GPUs total) also use NVSwitch 1.0 to form a full‑mesh across CPUs and GPUs.

NVSwitch delivers 50 GB/s per link and 300 GB/s per GPU, eliminating PCIe and Cube‑Mesh limitations.

NVSwitch’s full‑mesh also resolves inconsistencies in remote access latency seen in NVLink Cube‑Mesh.

Note: NVSwitch remains a proprietary NVIDIA solution; other vendors lack this capability.

NVLink 3.0, NVSwitch 2.0 and DGX A100

In 2020, NVIDIA released NVLink 3.0, NVSwitch 2.0, and the A100 GPU.

Each A100 has 12 links, each 50 GB/s bidirectional, giving a per‑GPU bandwidth of 600 GB/s.

NVSwitch 2.0 expands to 36 ports, each 50 GB/s.

The DGX A100 comprises eight A100 GPUs, two CPUs, four PCIe Gen4 switches, six NVSwitch chips, eight GPUs, eight NVMe drives, two 200 G NICs (PCIe direct for storage and SSH), and eight 200 G GPU‑dedicated NICs (PCIe Switch for GPU data transfer).

NVLink 4.0, NVSwitch 3.0 and DGX H100

In 2022, NVIDIA introduced NVLink 4.0, NVSwitch 3.0, and the H100 GPU.

Each H100 provides 18 links, each 50 GB/s bidirectional, for a total of 900 GB/s per GPU.

NVSwitch 3.0 increases ports to 64, each 50 GB/s.

The DGX H100 system includes eight H100 GPUs and four NVSwitch chips, forming a full‑mesh.

NVSwitch 3.0 is offered as a standalone switch with multiple 800 G OSFP optical modules, enabling cross‑node GPU connections.

NVSwitch vs. PCIe Switch

Comparison of topologies and performance evolution for NVSwitch and PCIe Switch.

NVLink performance iterations.

PCIe performance iterations.

AI trainingGPU interconnectNVLinkPCIeNVSwitch8-GPU server
AI Cyberspace
Written by

AI Cyberspace

AI, big data, cloud computing, and networking.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.