Why Do AI Large‑Model Training Clusters Need Specialized Network Topologies?
The article explains how AI large‑model training demands massive GPU resources and how carefully designed network architectures—such as Clos/Fat‑Tree, Spine‑Leaf, multi‑rail versus single‑rail connections, Dragonfly, and Torus—impact performance, scalability, cost, and reliability, guiding the selection of optimal data‑center networks.
In AI large‑model training the number of GPUs and the training duration are roughly proportional, so multi‑GPU (multi‑card) training is essential to reduce time, especially for models with billions or trillions of parameters.
Designing a large‑scale, high‑reliability, low‑cost, easy‑to‑operate network architecture is crucial to meet the high compute, low latency, and high throughput requirements of such training workloads.
Clos (Fat‑Tree) architecture
Clos (also called Fat‑Tree) provides a non‑blocking network with efficient routing, good scalability and easy management, making it a common choice for large‑model training clusters.
For small‑ to medium‑scale GPU clusters a two‑layer Spine‑Leaf architecture is typical; for larger scales a three‑layer Fat‑Tree (Core‑Spine‑Leaf) is used, which increases hop count and latency.
Multi‑rail vs single‑rail GPU server connections
Multi‑rail connects each GPU server’s eight NICs to eight Leaf switches, offering high communication efficiency as most traffic stays within the first leaf hop. Single‑rail connects all NICs of a server to a single leaf switch, resulting in lower efficiency but simpler cabling and easier fault isolation.
Typical industry designs
Tencent’s Star‑Mesh uses a non‑blocking Fat‑Tree topology divided into Cluster‑Pod‑Block hierarchy, supporting up to 65 536 GPUs (128‑port 400 Gbps switches, 1024 GPUs per Block, 64 Blocks per Pod).
Alibaba’s High‑Performance Networking (HPN) adopts a dual‑plane two‑layer design: each GPU server has eight 200 Gbps NICs connected to two leaf switches, with additional spare ports for rapid replacement and a 15:1 spine‑core convergence ratio, supporting up to 245 760 GPUs.
Dragonfly and Group‑wise Dragonfly+
Traditional Clos offers universality but higher latency and cost. Dragonfly reduces network diameter and deployment cost, supporting over 270 000 GPUs—four times more than a three‑layer Fat‑Tree—while lowering switch count and latency, though it requires re‑deployment for scaling.
Group‑wise Dragonfly+ combines a three‑layer Fat‑Tree for intra‑Pod connectivity with direct L2 links between Pods, achieving up to 200 000+ GPUs with better scalability and lower power consumption.
Torus topology
Torus provides a symmetric topology with low latency and small diameter, suitable for collective communication, but scaling may require topology redesign and incurs higher maintenance complexity.
Overall, selecting the appropriate network architecture—whether Clos, Fat‑Tree, Dragonfly, or Torus—depends on the target scale, cost, latency, and manageability requirements of AI large‑model training clusters.
Architects' Tech Alliance
Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
