Artificial Intelligence 8 min read

Designing High‑Performance Networks for Massive AI Model Training

This article examines how AI large‑model training demands massive GPU clusters and low‑latency, high‑throughput networks, compares Clos/Fat‑Tree, Spine‑Leaf, Dragonfly, Group‑wise Dragonfly+ and Torus topologies, and discusses design choices for scaling to hundreds of thousands of GPUs while noting related data‑center resources.

Architects' Tech Alliance

Sep 2, 2025

Designing High‑Performance Networks for Massive AI Model Training

In AI large‑model training, the number of GPUs and the training duration are positively correlated, and multi‑GPU training can dramatically shorten training time. For models with billions or even trillions of parameters, the compute cluster must support tens of thousands of GPUs, making the internal network architecture crucial for low latency and high throughput.

Clos (Fat‑Tree) architectures are widely used because of their efficient routing, scalability, and ease of management. Small‑to‑medium GPU clusters typically adopt a two‑layer Spine‑Leaf design, while larger clusters use a three‑layer Fat‑Tree (Core‑Spine‑Leaf) to expand capacity.

Real‑world deployments illustrate these concepts. Tencent’s Star‑Link HPN network employs a three‑level Cluster‑Pod‑Block hierarchy with 128‑port 400 Gbps switches, supporting up to 65 536 GPUs per cluster. Alibaba’s HPN network introduces a dual‑plane two‑layer architecture where each GPU server has eight 200 Gbps NICs, providing redundant uplinks to different leaf devices and enabling up to 245 760 GPUs.

Alternative topologies such as Dragonfly and Group‑wise Dragonfly+ offer lower network diameter and reduced cost. Dragonfly can support over 270 000 GPUs—approximately four times the capacity of a comparable Fat‑Tree—while reducing switch count and hop count by about 20 %. However, each expansion of a Dragonfly network requires re‑deployment of links, which can affect maintainability.

Group‑wise Dragonfly+ (GW‑DF+) connects pods directly, avoiding intermediate routing and improving system efficiency. With 400 Gbps switches, this architecture can scale to more than 200 000 GPUs, and when chassis switches replace leaf devices, the scale can exceed 500 000 GPUs.

Torus networks provide a fully symmetric topology with low latency and small network diameter, making them suitable for collective communication. Nevertheless, scaling a Torus may involve topology re‑adjustments and higher maintenance complexity.

Network Architecture AI Data Center Large-Scale Training

Written by

Architects' Tech Alliance

Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.