Cloud Computing 29 min read

How Baidu Cloud Optimizes GPU Servers for AI Workloads

This article explains the design and implementation of GPU cloud servers, covering data processing pipelines, hardware selection, topology, interconnect technologies, virtualization, multi‑GPU communication methods, and Baidu's practical solutions for both virtualized and bare‑metal instances to boost AI inference and training performance.

Baidu Intelligent Cloud Tech Hub

Mar 3, 2025

How Baidu Cloud Optimizes GPU Servers for AI Workloads

When deploying DeepSeek large models on the cloud, users can choose multi‑node or single‑node 8‑GPU bare‑metal instances for full‑capacity versions, or single/dual‑GPU virtual machines for distilled versions, directly affecting service throughput and training time.

1. GPU Data Processing Flow

The data processing pipeline includes six steps:

Read data from network or storage into memory (network/storage transfer performance).

CPU pre‑processes data in memory and writes back (memory bandwidth and CPU performance).

Copy data from memory to GPU memory (Host‑to‑Device transfer).

GPU computes using data in its memory (GPU memory bandwidth and compute performance, possibly involving multi‑GPU collective communication).

Intra‑node multi‑GPU data transfer or inter‑node network transfer (intra‑node or inter‑node bandwidth).

Copy results from GPU memory back to host memory (Device‑to‑Host transfer).

Designing GPU cloud servers requires considering each link in this chain and balancing performance with cost.

2. Layered Design of GPU Cloud Servers

The design is divided into four layers:

Bottom layer: hardware components such as CPU, memory, GPU, network, storage, and their selection.

Next layer: multi‑GPU communication methods (shared memory, PCIe P2P, proprietary buses like NVLink, HCCS, RDMA).

Third layer: collective communication libraries (CCL) that select routes and algorithms based on the underlying interconnect.

Top layer: AI frameworks that rely on efficient collective communication.

2.1 GPU Cloud Server Fundamentals

Key aspects include hardware selection, topology (NUMA and PCIe), GPU interconnect technologies, and virtualization (VM or bare‑metal).

2.1.1 Hardware Selection

CPU: choose based on model type (inference or training), core count, and frequency.

Memory: sufficient capacity and bandwidth (e.g., DDR5 up to 1200 GB/s).

GPU: select based on compute performance, memory size, and interconnect.

Network: high‑bandwidth north‑south and east‑west networks (e.g., 100 Gbps or 200 Gbps for A100/A800).

Storage: large‑capacity SSDs or distributed file systems for training data.

2.1.2 Single‑Node Hardware Topology

Includes NUMA topology (CPU, memory, GPU placement) and PCIe topology (GPU‑to‑CPU, GPU‑to‑RDMA NIC).

2.1.3 GPU Interconnect Technologies

Scale‑Up: intra‑node connections via PCIe or proprietary buses (NVLink, HCCS, etc.).

Scale‑Out: inter‑node connections via RDMA networks (Infiniband, RoCE).

PCIe 6.0 offers up to 64 GT/s per lane (256 GB/s bidirectional).

NVLink 4th generation provides up to 900 GB/s bidirectional bandwidth.

NVSwitch enables full‑mesh GPU interconnect.

Huawei HCCS offers 56 GB/s per link for Ascend processors.

2.1.4 Virtualization

Two main approaches:

Traditional VM (QEMU/KVM) with GPU passthrough (low overhead).

DPU‑based virtualization, where the DPU offloads network and storage, allowing near‑bare‑metal performance.

2.2 Multi‑GPU Communication

Three intra‑node methods:

Shared memory (lowest bandwidth, ~10‑20 GB/s).

PCIe P2P (direct GPU‑to‑GPU, ~40‑50 GB/s).

Proprietary buses (NVLink, HCCS, etc., up to 400‑900 GB/s).

For inter‑node communication, RDMA‑based GDR (GPU Direct RDMA) bypasses host memory, reducing hops and increasing bandwidth 3‑4× compared to traditional paths.

3. Baidu Intelligent Cloud Practices

3.1 BCC (Virtualized) GPU Servers

Based on QEMU/KVM with VFIO GPU passthrough, supporting 1, 2, 4, 8‑GPU instances. Enhancements include:

NUMA awareness for GPUs.

Multi‑GPU PCIe P2P support.

Shared NVSwitch for A100/A800.

RDMA NIC passthrough with ATS and ACS optimizations.

Issues such as missing GPU NUMA info in VMs are solved by mapping each physical NUMA node to a virtual PCIe bus, ensuring GPU, CPU, and memory reside on the same NUMA node.

3.2 BBC (Bare‑Metal) GPU Servers

These servers eliminate virtualization overhead, integrating GPUs, NVSwitch, RDMA, and a DPU for network/storage offload. DPU capabilities include:

Network protocol acceleration (e.g., VxLAN, RDMA).

Storage acceleration via NVMe‑oF and SNAP.

Typical models:

A800: 8 × A800 SXM4 80 GB GPUs, 8 × CX6 100 Gbps RoCE NICs, NVLink 400 GB/s.

L20: PCIe‑P2P enabled, each switch connects 2 L20 GPUs and a 400 Gbps RoCE NIC.

H20: 8 × H100‑like GPUs (96 GB or 141 GB), NVLink/NVSwitch support for large‑model inference and training.

4. Future Directions

4.1 Converging Scale‑Up and Scale‑Out

Future clusters will blend high‑density intra‑node interconnect (NVLink) with high‑bandwidth inter‑node RDMA, reducing the distinction between scaling up and out.

4.2 Inference vs. Training Cards

Inference workloads prioritize GPU memory size and interconnect bandwidth, while training benefits from higher compute density; ASIC inference chips may further lower costs.