Cloud Computing 27 min read

Inside GPU Cloud Servers: Architecture, Interconnects, and Performance Secrets

This article provides a comprehensive technical overview of GPU cloud server design, covering data‑processing pipelines, hardware topology, NUMA considerations, PCIe and proprietary interconnects, multi‑GPU communication strategies, virtualization approaches (BCC and BBC), DPU acceleration, and future trends for scaling up and out.

Baidu Geek Talk

Mar 5, 2025

Inside GPU Cloud Servers: Architecture, Interconnects, and Performance Secrets

GPU Data Processing Pipeline

The article begins by outlining the six steps of GPU data handling: (1) network or storage reads into memory, (2) CPU pre‑processes data and writes back to memory, (3) Host‑to‑Device (H2D) transfer copies data to GPU memory, (4) GPU reads from its memory for computation (including intra‑GPU and inter‑GPU communication), (5) intra‑node or inter‑node GPU‑to‑GPU transfers, and (6) Device‑to‑Host (D2H) copies results back to system memory.

GPU Cloud Service Design Layers

The design is divided into four hierarchical layers:

Bottom layer – hardware fundamentals: hardware selection, topology (NUMA, PCIe), GPU interconnect technologies, and virtualization choices (bare‑metal vs. VM).

Multi‑GPU communication layer: shared memory, PCIe P2P, and proprietary buses (NVLink, HCCS, Infinity Fabric).

Collective communication libraries: CCL/NCCL detect available paths and select optimal algorithms.

AI frameworks: rely on the underlying communication performance.

Fundamental GPU Cloud Server Technologies

Hardware selection includes CPU, memory (e.g., DDR5 up to 1200 GB/s), GPU (compute capability, memory size, interconnect), network (north‑south and east‑west fabrics, 100‑200 Gbps for A100/A800), and storage (high‑performance SSD or distributed file systems).

Topology covers NUMA domains, PCIe layout, and virtual topology for VMs.

GPU interconnect varies by deployment: single‑node PCIe or proprietary buses for multi‑GPU, and multi‑node RDMA (Infiniband or RoCE) for scale‑out.

Virtualization can be traditional QEMU/KVM (GPU passthrough with VFIO) or DPU‑based designs that offload networking and storage, providing near‑bare‑metal performance.

NUMA Considerations

Enabling NUMA and binding GPUs to the same NUMA node as CPU and memory reduces latency and improves bandwidth, especially on AMD Milan platforms where cross‑NUMA PCIe bandwidth is halved.

PCIe and Proprietary Buses

PCIe 6.0 offers up to 64 GT/s (256 GB/s bidirectional), but GPU demands often exceed this, prompting the use of NVLink (up to 900 GB/s in H100) and Huawei HCCS (56 GB/s per link). The article explains scale‑up (adding GPUs within a node) versus scale‑out (adding nodes) strategies.

Multi‑GPU Communication

Three single‑node approaches are described:

Shared memory (10‑20 GB/s bidirectional bandwidth).

PCIe P2P (40‑50 GB/s).

Proprietary buses (NVLink, HCCS, Infinity Fabric) delivering 400‑900 GB/s.

For multi‑node scenarios, RDMA‑based GDR (GPU Direct RDMA) bypasses system memory, achieving 3‑4× higher bandwidth than traditional paths.

Optimizing GDR in Virtual Machines

Enable ATS on RDMA NICs to cache address translations.

Activate ACS on PCIe switches to allow direct GPU‑to‑NIC routing.

When NCCL fails to detect GDR in VMs, two work‑arounds are offered: adjusting the VM PCIe topology to match physical layout, or providing a fabricated topology file via environment variables to trick NCCL into using GDR.

BCC (Virtualized) GPU Cloud Servers

Based on QEMU/KVM with VFIO passthrough, supporting 1‑8 GPUs, NUMA awareness, P2P, shared NVSwitch, and RDMA. Limitations stem from virtual topology differences that can affect performance.

BBC (Bare‑Metal) GPU Cloud Servers

Offer zero‑overhead compute by eliminating virtualization; include DPU for network/storage offload, achieving full hardware performance.

DPU Capabilities

Network offload (e.g., VxLAN, RDMA) reduces CPU latency.

Storage acceleration via NVMe‑oF and SNAP.

Representative BBC Models

A800 : 8 × A800 SXM4 80 GB, 400 GB/s NVLink, 8 × CX6 100 Gbps RoCEv2.

L20 : PCIe‑P2P enabled, paired with 400 Gbps RoCE NICs.

H20 : 96 GB (or 141 GB) H100‑class GPUs with NVLink/NVSwitch, suitable for large‑model training and inference.

Future Directions

The article predicts a convergence of scale‑up and scale‑out techniques, with high‑density nodes (e.g., NVL72) using NVLink for inter‑node links, and a growing distinction between inference‑optimized and training‑optimized GPUs. ASIC inference chips are expected to improve cost‑efficiency.

performance optimization cloud computing GPU virtualization hardware architecture Interconnect

Written by

Baidu Geek Talk

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.