Designing High‑Performance Networks for Large‑Scale AI Model Training
This article examines the challenges of building scalable, low‑latency, and cost‑effective network architectures—such as Clos/Fat‑Tree, Spine‑Leaf, Dragonfly, and Torus—for massive GPU clusters used in training trillion‑parameter AI models, comparing multi‑rail and single‑rail designs and highlighting real‑world implementations from Tencent and Alibaba.
In AI large‑model training scenarios, the number of GPUs and the training duration are typically proportional, so multi‑GPU training can dramatically shorten training time. For models with billions or trillions of parameters, compute clusters must support tens of thousands of GPUs and provide high compute power, low latency, and high throughput.
Designing a large‑scale, high‑reliability, low‑cost, and easy‑to‑maintain network architecture is therefore crucial to meet the demanding compute, latency, and bandwidth requirements of large‑model training.
Clos (Fat‑Tree) Network Architecture – The non‑blocking Fat‑Tree topology, also known as a Clos network, is widely used for large‑model training due to its efficient routing, good scalability, and ease of management.
For small‑ to medium‑scale GPU clusters, a two‑layer Spine‑Leaf architecture is common, while larger clusters adopt a three‑layer Core‑Spine‑Leaf design, which increases hop count and latency as the hierarchy deepens.
GPU server connections can be multi‑rail or single‑rail. In the multi‑rail approach, each GPU server’s eight NICs connect to eight separate leaf switches, providing high communication efficiency. In the single‑rail approach, all NICs of a GPU server connect to a single leaf switch, which simplifies cabling but reduces communication efficiency; a leaf switch failure impacts more GPUs than in the multi‑rail case.
Tencent Star‑Mesh HPN Network – Uses a non‑blocking Fat‑Tree topology divided into three layers: Cluster‑Pod‑Block. With 128‑port 400 Gbps switches, each Block contains 1,024 GPUs; each Pod supports up to 64 Blocks (65,536 GPUs); a Cluster can host 524,288 GPUs.
Alibaba Cloud HPN Network – Introduces a dual‑plane two‑layer architecture. Each GPU server has eight GPUs and eight 200 Gbps NICs, each NIC connecting to a different leaf switch. Leaf switches provide extra 8×200 Gbps ports for rapid replacement. The spine layer connects multiple segments, each segment containing 1,024 GPUs. Pods contain 15 segments (15,360 GPUs). The design uses a 15:1 spine‑core convergence ratio, supporting up to 245,760 GPUs.
Dragonfly Network – Traditional Clos networks are universal but not optimal for latency and cost. Dragonfly offers a smaller network diameter and lower deployment cost, supporting over 270,000 GPUs—four times more than a comparable Fat‑Tree—while reducing switch count and hop count by about 20%. However, each expansion requires redeploying links, making maintenance more challenging.
Group‑wise Dragonfly+ (GW‑DF+) – For scales exceeding 100,000 GPUs, a three‑layer Fat‑Tree with converged second layer (L2) can reduce L3 device count, saving cost and power. GW‑DF+ is a direct‑connect architecture where pods interconnect via a two‑layer Fat‑Tree, and L2 devices within a pod are pairwise directly linked. Using 400 Gbps 51.2 Tbps switches, this design can support over 200,000 GPUs, and with chassis switches, scales beyond one million GPUs.
Torus Network – A fully symmetric topology with low latency and small network diameter, suitable for collective communication and reducing construction costs. However, scaling a Torus may require topology re‑adjustments and incurs higher maintenance complexity.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Open Source Linux
Focused on sharing Linux/Unix content, covering fundamentals, system development, network programming, automation/operations, cloud computing, and related professional knowledge.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
