Cloud Computing 11 min read

Network Architecture Selection and Comparison for AI Compute Centers

The article analyzes traditional cloud data‑center networking challenges for AI workloads and compares two‑layer and three‑layer fat‑tree architectures, presenting high‑bandwidth, non‑blocking, and low‑latency designs such as AI‑Pool networks and offering practical deployment scales from hundreds to tens of thousands of GPUs.

IT Architects Alliance

Jun 12, 2024

Network Architecture Selection and Comparison for AI Compute Centers

Traditional cloud data‑center networks are generally designed based on an external service traffic model, where traffic mainly flows from the data center to end customers (north‑south), with east‑west traffic inside the cloud as secondary. The underlying physical network architecture that carries VPC networks faces the following challenges for AI compute workloads.

This article is selected from “AI Compute Center Network Architecture Selection and Comparison”, which analyzes and compares traditional networks with AI compute networks, two‑layer fat‑tree, three‑layer fat‑tree, and provides best practices for networking.

Blocked network: Since not all servers generate external traffic simultaneously, to control network construction cost, leaf switch downlink and uplink bandwidth are not designed 1:1, but have a convergence ratio; typically uplink bandwidth is only one third of downlink.

Higher latency for internal cloud traffic: Two servers in different leaf switches need to traverse a spine switch, resulting in a 3‑hop forwarding path.

Insufficient bandwidth: Usually a single physical machine has only one NIC attached to the VPC network, with limited bandwidth; commercially available NICs generally do not exceed 200 Gbps.

For AI compute scenarios, the current best practice is to build an independent high‑performance network to carry AI workloads, meeting requirements of high bandwidth, low latency, and lossless transmission.

High‑bandwidth design : AI compute servers can be equipped with up to 8 GPUs and reserve 8 PCIe NIC slots. In multi‑machine GPU clusters, burst bandwidth between GPUs across machines can exceed 50 Gbps, so each GPU is typically associated with at least a 100 Gbps network port. Configurations may include 4×2×100 Gbps NICs, 8×1×100 Gbps NICs, or 8 single‑port 200/400 Gbps NICs.

Non‑blocking design : The key is to adopt a Fat‑Tree network architecture with 1:1 uplink/downlink bandwidth (no convergence). If a switch has 64×100 Gbps downlink ports, it also has 64×100 Gbps uplink ports. Use data‑center‑class switches that provide full‑port non‑blocking forwarding.

Low‑latency AI‑Pool design : Baidu Intelligent Cloud implements a rail‑optimized AI‑Pool network. Eight access switches form an AI‑Pool; within the same AI‑Pool, GPU‑to‑GPU communication between different AI compute nodes requires only one hop.

In the AI‑Pool, NICs with the same index on different nodes are connected to the same switch, enabling intra‑node matching by communication libraries so that GPUs with the same index can communicate in one hop.

For cross‑AI‑Pool communication between two physical machines, traffic must pass through aggregation switches, resulting in three hops.

The scale of GPUs that the network can support depends on switch port density and network architecture. More hierarchy increases supported GPU count but also adds hops and latency, requiring trade‑offs.

Two‑layer Fat‑Tree architecture : Eight access switches form an AI‑Pool. A switch with P ports can connect up to P/2 servers and P/2 switches, allowing P×P/2 GPUs.

Three‑layer Fat‑Tree architecture : Adds aggregation and core switch groups. Each group can have up to P/2 switches; aggregation groups up to 8, core groups up to P/2. Supports P×(P/2)×(P/2)=P³/4 GPUs. With 40‑port 200 Gbps HDR InfiniBand switches, the maximum supported GPUs is 16,000, a record held by Baidu.

Comparison of two‑layer vs three‑layer Fat‑Tree :

GPU capacity : With 40‑port switches, two‑layer supports 800 GPUs, three‑layer supports 16,000 GPUs.

Forwarding path hops : Same‑index GPU communication: 1 hop in two‑layer, 3 hops in three‑layer. Different‑index communication without Rail Local optimization: 3 hops vs 5 hops.

Typical practice :

Different models of InfiniBand/RoCE switches and network architectures support different GPU scales. Recommended specifications:

Regular: InfiniBand two‑layer Fat‑Tree, up to 800 GPUs.

Large: RoCE two‑layer Fat‑Tree with 128‑port 100 G data‑center Ethernet switches, up to 8,192 GPUs.

XLarge: InfiniBand three‑layer Fat‑Tree, up to 16,000 GPUs.

XXLarge: InfiniBand Quantum‑2 or equivalent Ethernet switches, three‑layer Fat‑Tree, up to 100,000 GPUs.

Large AI compute physical network practice : Supports up to 8,192 GPUs per cluster, with each AI‑Pool supporting 512 GPUs, using non‑blocking, low‑latency, high‑reliability network design to enable rapid AI application iteration.

XLarge AI compute physical network practice : Baidu Intelligent Cloud designed an InfiniBand network for ultra‑large clusters, using 200 Gbps HDR switches; each GPU server has 1.6 Tbps external bandwidth.

Disclaimer : The material is collected from the internet, copyright belongs to the original authors, and the content reflects personal views only. It is provided for learning and exchange; please verify independently.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Network Architecture low-latency AI compute cloud data center Fat-Tree High Bandwidth

Written by

IT Architects Alliance

Discussion and exchange on system, internet, large‑scale distributed, high‑availability, and high‑performance architectures, as well as big data, machine learning, AI, and architecture adjustments with internet technologies. Includes real‑world large‑scale architecture case studies. Open to architects who have ideas and enjoy sharing.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.