Why Fat-Tree, Dragonfly, and Torus Topologies Matter for HPC Networks
The article analyzes three major high‑performance‑computing network topologies—Fat‑Tree, Dragonfly, and Torus—detailing their design principles, scalability formulas, routing strategies, advantages, and limitations to help architects choose the most suitable architecture for large‑scale GPU clusters.
High‑performance computing (HPC) environments require not only low static latency but also the ability to scale to massive numbers of nodes. Traditional CLOS architectures prioritize generality at the expense of latency and cost‑effectiveness, prompting research into alternative topologies such as Fat‑Tree, Dragonfly, and Torus.
Fat‑Tree Architecture
The Fat‑Tree topology avoids bandwidth convergence toward the root; instead, bandwidth remains constant from leaves to the root, enabling non‑blocking communication. It is widely used because it offers low latency and supports various throughput options, from non‑blocking to oversubscribed configurations.
With a switch port count of n, a two‑level Fat‑Tree can connect up to n²/2 GPUs. Using a 40‑port InfiniBand switch, the maximum reachable GPUs are 800. A three‑level Fat‑Tree scales to n·(n/2)·(n/2) GPUs, reaching up to 16 000 GPUs with the same switch.
High cost: the switch‑to‑server ratio is large, requiring many switches and links, especially at small n.
Poor support for One‑to‑All and All‑to‑All traffic patterns, limiting suitability for MapReduce, Dryad, and similar distributed workloads.
Scalability limited by the number of ports on core switches.
Dragonfly Architecture
Proposed by John Kim et al. in 2008, Dragonfly features a small network diameter and low cost, making it popular in HPC and data‑center networks. Its three‑level hierarchy consists of Switch, Group, and System layers.
Switch layer : each switch connects to p compute nodes.
Group layer : contains a switches; each switch has a‑1 inter‑group links (all‑to‑all connectivity).
System layer : comprises g groups, also fully connected.
Key parameters allow the topology to be expressed as dfly(p,a,h,g). A balanced configuration often uses a = 2p = 2h.
Routing algorithms include:
Minimal Routing : at most 1 global link and 2 local links, typically 3 hops.
Non‑Minimal (Valiant) Routing : selects a random intermediate group, using up to 2 global and 3 local links, up to 5 hops.
Adaptive Routing (e.g., UGAL, UGAL‑L, UGAL‑G): dynamically chooses between minimal and non‑minimal paths based on congestion.
Dragonfly delivers strong performance for a wide range of applications, reducing hop count to as few as 3 with 64‑port switches supporting up to 270 000 nodes.
Torus Architecture
Torus networks are fully symmetric, offering small diameter, simple structure, multiple paths, and good scalability, making them suitable for collective communication in distributed machine learning.
Common variants include 2D‑Torus and 3D‑Torus, represented as a k‑ary n‑cube where k is the length of each dimension and n is the number of dimensions.
Example 2D‑Torus deployment:
Horizontal: each server hosts X GPUs, interconnected via NVLINK.
Vertical: each server connects to at least two RDMA NICs (NIC 0/NIC 1) through switches.
Typical communication steps for a 2D‑Torus‑based collective operation:
Intra‑node Ring Scatter‑Reduce splits gradients across 8 GPUs.
Inter‑node Ring All‑Reduce aggregates data across X servers.
Intra‑node All‑Gather replicates the reduced gradients to all GPUs on each server.
Advantages: lower latency, better locality, and smaller network diameter compared to CLOS.
Disadvantages: unpredictable performance, complex scaling (may require full re‑wiring), fewer alternative paths than Fat‑Tree, and more challenging fault diagnosis.
Higher‑dimensional Torus (4D/5D/6D) designs are emerging, where multiple “silicon‑atoms” each implement a 3D‑Torus and are interconnected to form larger direct networks.
Overall, each topology presents trade‑offs between cost, scalability, latency, and routing complexity. Architects must weigh these factors against workload communication patterns and hardware constraints when designing HPC clusters.
Architects' Tech Alliance
Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
