Why Fat-Tree, Dragonfly, and Torus Topologies Matter in HPC Networks
The article examines the challenges of ultra‑large‑scale HPC networking, compares traditional CLOS with Fat‑Tree, Dragonfly, and Torus topologies, explains their bandwidth and latency characteristics, presents scalability formulas, and evaluates routing algorithms and practical trade‑offs for each design.
Background
High‑performance computing (HPC) workloads demand both low static latency and support for ultra‑large‑scale network fabrics. Traditional CLOS architectures prioritize universality but sacrifice latency and cost‑effectiveness, prompting research into alternative topologies such as Fat‑Tree, Dragonfly, and Torus.
Fat‑Tree Architecture
Fat‑Tree mimics a real tree where bandwidth does not converge toward the root; each level maintains sufficient bandwidth to achieve a non‑blocking network. It is widely adopted because it offers low latency and flexible throughput options.
The design follows a 1:1 non‑convergent model: uplink and downlink ports on each switch have equal bandwidth, and data‑center‑grade non‑blocking switches are used. Scaling the network by adding hierarchical layers increases the number of accessible GPU nodes.
For a switch with n ports, a two‑level Fat‑Tree can connect up to n²/2 GPUs. Using a 40‑port InfiniBand switch, this yields a maximum of 800 GPUs. A three‑level Fat‑Tree can support up to n·(n/2)·(n/2) GPUs, reaching 16 000 GPUs with the same switch.
High switch‑to‑server ratio leads to higher hardware cost and cabling complexity.
Topology does not favor One‑to‑All or All‑to‑All traffic patterns, limiting suitability for MapReduce, Dryad, etc.
Scalability is bounded by the port count of core switches.
Because Fat‑Tree’s essence is a non‑convergent CLOS network, large clusters require additional network layers, more optical fibers, and incur higher latency as hop count grows.
Dragonfly Architecture
Dragonfly, introduced by John Kim et al. (2008), is a widely used direct‑connect topology featuring a small network diameter and low cost, making it suitable for both HPC clusters and heterogeneous data‑center workloads.
The topology consists of three hierarchical layers:
Switch layer : a switch connected to p compute nodes.
Group layer : a switches fully interconnected (all‑to‑all), each with a‑1 inter‑group links.
System layer : g groups, also fully interconnected.
Key parameters:
Ports per switch: k = p + (a‑1) + h (where h connects to other groups).
Number of groups: g = a·h + 1.
Total compute nodes: N = a·p·(a·h + 1).
A balanced configuration often uses a = 2p = 2h. Routing algorithms include:
Minimal Routing : at most 1 global link and 2 local links, yielding ≤ 3 hops.
Non‑Minimal (Valiant) Routing : selects an intermediate group, using up to 2 global and 3 local links (≤ 5 hops).
Adaptive Routing (e.g., UGAL, UGAL‑L, UGAL‑G): dynamically chooses between minimal and non‑minimal paths based on congestion, improving performance.
Dragonfly provides excellent performance for diverse applications, reducing hop count to as few as 3 hops for a 64‑port switch supporting up to 270 000 nodes.
Torus Architecture
Torus networks are symmetric, low‑diameter topologies ideal for collective communication in distributed machine learning. They are expressed as k‑ary n‑cube, where k is the length of each dimension and n the number of dimensions.
Examples:
2D‑Torus : each server connects to X GPUs via private links (e.g., NVLINK) and to at least two RDMA NICs for vertical connectivity.
3D‑Torus : extends the concept to three dimensions, enabling higher bandwidth and scalability.
Advantages:
Lower latency due to short direct links between neighboring nodes.
Improved locality, reducing communication overhead and power consumption.
Smaller network diameter compared to CLOS, requiring fewer switches for the same node count.
Disadvantages:
Predictability of performance can be challenging.
Scaling may require reconfiguration of the entire topology.
Fewer alternative paths than Fat‑Tree, affecting load balancing.
Fault diagnosis can be more complex, though adaptive routing mitigates impact.
Higher‑dimensional Torus variants (4D/5D/6D) are emerging, where a “silicon element” implements a 3D‑Torus internally, and multiple elements combine to form larger direct networks.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Architects' Tech Alliance
Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
