Why Do AI Large‑Model Training Clusters Need Specialized Network Topologies?

The article explains how AI large‑model training demands massive GPU resources and how carefully designed network architectures—such as Clos/Fat‑Tree, Spine‑Leaf, multi‑rail versus single‑rail connections, Dragonfly, and Torus—impact performance, scalability, cost, and reliability, guiding the selection of optimal data‑center networks.

Architects' Tech Alliance
Architects' Tech Alliance
Architects' Tech Alliance
Why Do AI Large‑Model Training Clusters Need Specialized Network Topologies?

In AI large‑model training the number of GPUs and the training duration are roughly proportional, so multi‑GPU (multi‑card) training is essential to reduce time, especially for models with billions or trillions of parameters.

Network overview
Network overview

Designing a large‑scale, high‑reliability, low‑cost, easy‑to‑operate network architecture is crucial to meet the high compute, low latency, and high throughput requirements of such training workloads.

Clos (Fat‑Tree) architecture

Clos (also called Fat‑Tree) provides a non‑blocking network with efficient routing, good scalability and easy management, making it a common choice for large‑model training clusters.

For small‑ to medium‑scale GPU clusters a two‑layer Spine‑Leaf architecture is typical; for larger scales a three‑layer Fat‑Tree (Core‑Spine‑Leaf) is used, which increases hop count and latency.

Fat‑Tree topology
Fat‑Tree topology

Multi‑rail vs single‑rail GPU server connections

Multi‑rail connects each GPU server’s eight NICs to eight Leaf switches, offering high communication efficiency as most traffic stays within the first leaf hop. Single‑rail connects all NICs of a server to a single leaf switch, resulting in lower efficiency but simpler cabling and easier fault isolation.

Multi‑rail vs single‑rail
Multi‑rail vs single‑rail

Typical industry designs

Tencent’s Star‑Mesh uses a non‑blocking Fat‑Tree topology divided into Cluster‑Pod‑Block hierarchy, supporting up to 65 536 GPUs (128‑port 400 Gbps switches, 1024 GPUs per Block, 64 Blocks per Pod).

Tencent Star‑Mesh
Tencent Star‑Mesh

Alibaba’s High‑Performance Networking (HPN) adopts a dual‑plane two‑layer design: each GPU server has eight 200 Gbps NICs connected to two leaf switches, with additional spare ports for rapid replacement and a 15:1 spine‑core convergence ratio, supporting up to 245 760 GPUs.

Alibaba HPN
Alibaba HPN

Dragonfly and Group‑wise Dragonfly+

Traditional Clos offers universality but higher latency and cost. Dragonfly reduces network diameter and deployment cost, supporting over 270 000 GPUs—four times more than a three‑layer Fat‑Tree—while lowering switch count and latency, though it requires re‑deployment for scaling.

Dragonfly topology
Dragonfly topology

Group‑wise Dragonfly+ combines a three‑layer Fat‑Tree for intra‑Pod connectivity with direct L2 links between Pods, achieving up to 200 000+ GPUs with better scalability and lower power consumption.

Group‑wise Dragonfly+
Group‑wise Dragonfly+

Torus topology

Torus provides a symmetric topology with low latency and small diameter, suitable for collective communication, but scaling may require topology redesign and incurs higher maintenance complexity.

Torus network
Torus network

Overall, selecting the appropriate network architecture—whether Clos, Fat‑Tree, Dragonfly, or Torus—depends on the target scale, cost, latency, and manageability requirements of AI large‑model training clusters.

Network ArchitectureAIhigh performance computingData Centerlarge model trainingGPU clusters
Architects' Tech Alliance
Written by

Architects' Tech Alliance

Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.