Inside GPU Cloud Servers: Architecture, Interconnects, and Performance Secrets
This article provides a comprehensive technical overview of GPU cloud server design, covering data‑processing pipelines, hardware topology, NUMA considerations, PCIe and proprietary interconnects, multi‑GPU communication strategies, virtualization approaches (BCC and BBC), DPU acceleration, and future trends for scaling up and out.
GPU Data Processing Pipeline
The article begins by outlining the six steps of GPU data handling: (1) network or storage reads into memory, (2) CPU pre‑processes data and writes back to memory, (3) Host‑to‑Device (H2D) transfer copies data to GPU memory, (4) GPU reads from its memory for computation (including intra‑GPU and inter‑GPU communication), (5) intra‑node or inter‑node GPU‑to‑GPU transfers, and (6) Device‑to‑Host (D2H) copies results back to system memory.
GPU Cloud Service Design Layers
The design is divided into four hierarchical layers:
Bottom layer – hardware fundamentals: hardware selection, topology (NUMA, PCIe), GPU interconnect technologies, and virtualization choices (bare‑metal vs. VM).
Multi‑GPU communication layer: shared memory, PCIe P2P, and proprietary buses (NVLink, HCCS, Infinity Fabric).
Collective communication libraries: CCL/NCCL detect available paths and select optimal algorithms.
AI frameworks: rely on the underlying communication performance.
Fundamental GPU Cloud Server Technologies
Hardware selection includes CPU, memory (e.g., DDR5 up to 1200 GB/s), GPU (compute capability, memory size, interconnect), network (north‑south and east‑west fabrics, 100‑200 Gbps for A100/A800), and storage (high‑performance SSD or distributed file systems).
Topology covers NUMA domains, PCIe layout, and virtual topology for VMs.
GPU interconnect varies by deployment: single‑node PCIe or proprietary buses for multi‑GPU, and multi‑node RDMA (Infiniband or RoCE) for scale‑out.
Virtualization can be traditional QEMU/KVM (GPU passthrough with VFIO) or DPU‑based designs that offload networking and storage, providing near‑bare‑metal performance.
NUMA Considerations
Enabling NUMA and binding GPUs to the same NUMA node as CPU and memory reduces latency and improves bandwidth, especially on AMD Milan platforms where cross‑NUMA PCIe bandwidth is halved.
PCIe and Proprietary Buses
PCIe 6.0 offers up to 64 GT/s (256 GB/s bidirectional), but GPU demands often exceed this, prompting the use of NVLink (up to 900 GB/s in H100) and Huawei HCCS (56 GB/s per link). The article explains scale‑up (adding GPUs within a node) versus scale‑out (adding nodes) strategies.
Multi‑GPU Communication
Three single‑node approaches are described:
Shared memory (10‑20 GB/s bidirectional bandwidth).
PCIe P2P (40‑50 GB/s).
Proprietary buses (NVLink, HCCS, Infinity Fabric) delivering 400‑900 GB/s.
For multi‑node scenarios, RDMA‑based GDR (GPU Direct RDMA) bypasses system memory, achieving 3‑4× higher bandwidth than traditional paths.
Optimizing GDR in Virtual Machines
Enable ATS on RDMA NICs to cache address translations.
Activate ACS on PCIe switches to allow direct GPU‑to‑NIC routing.
When NCCL fails to detect GDR in VMs, two work‑arounds are offered: adjusting the VM PCIe topology to match physical layout, or providing a fabricated topology file via environment variables to trick NCCL into using GDR.
BCC (Virtualized) GPU Cloud Servers
Based on QEMU/KVM with VFIO passthrough, supporting 1‑8 GPUs, NUMA awareness, P2P, shared NVSwitch, and RDMA. Limitations stem from virtual topology differences that can affect performance.
BBC (Bare‑Metal) GPU Cloud Servers
Offer zero‑overhead compute by eliminating virtualization; include DPU for network/storage offload, achieving full hardware performance.
DPU Capabilities
Network offload (e.g., VxLAN, RDMA) reduces CPU latency.
Storage acceleration via NVMe‑oF and SNAP.
Representative BBC Models
A800 : 8 × A800 SXM4 80 GB, 400 GB/s NVLink, 8 × CX6 100 Gbps RoCEv2.
L20 : PCIe‑P2P enabled, paired with 400 Gbps RoCE NICs.
H20 : 96 GB (or 141 GB) H100‑class GPUs with NVLink/NVSwitch, suitable for large‑model training and inference.
Future Directions
The article predicts a convergence of scale‑up and scale‑out techniques, with high‑density nodes (e.g., NVL72) using NVLink for inter‑node links, and a growing distinction between inference‑optimized and training‑optimized GPUs. ASIC inference chips are expected to improve cost‑efficiency.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
