Comparison of Fat-Tree, Dragonfly, and Torus Network Topologies for AI and High‑Performance Computing
The article reviews Fat‑Tree, Dragonfly, and Torus network topologies, analyzing their bandwidth, scalability, latency, routing algorithms, and cost trade‑offs for AI‑driven high‑performance computing clusters, and highlights each design's strengths and limitations in large‑scale deployments.
AI‑driven high‑performance computing (HPC) workloads require ultra‑low latency and massive scalability, which traditional CLOS architectures struggle to provide due to their focus on generality at the expense of delay and cost‑effectiveness.
Fat‑Tree eliminates bandwidth convergence by using a 1:1 non‑convergent design, allowing each switch’s uplink and downlink ports to have equal bandwidth. With an n‑port switch, a two‑level Fat‑Tree can connect up to n²/2 GPU cards (e.g., a 40‑port InfiniBand switch can support up to 800 GPUs), while a three‑level variant can reach n·(n/2)·(n/2) GPUs (up to 16 000 in the same example). Its advantages are non‑blocking forwarding and high throughput, but it suffers from high switch and cabling costs, poor support for One‑to‑All/All‑to‑All traffic patterns (affecting MapReduce, Dryad, etc.), and scalability limited by core‑layer port counts.
Dragonfly is a widely adopted low‑diameter direct‑connect topology introduced by John Kim et al. (2008). It consists of three layers: Switch, Group, and System. Key parameters are p (ports to compute nodes), a (switches per group), h (inter‑group links per switch), and g (number of groups, g = a·h + 1). The total number of compute nodes is N = a·p·(a·h + 1). Routing algorithms include Minimal Routing (≤3 hops), Non‑Minimal (VAL/VLB, ≤5 hops), and Adaptive Routing (e.g., UGAL, UGAL‑L, UGAL‑G) that dynamically choose between shortest and longer paths based on congestion, offering better performance under load.
Torus provides a symmetric, low‑diameter topology with simple structure and multiple paths, making it suitable for collective communication in distributed machine learning. Variants such as 2D‑Torus, 3D‑Torus, and higher‑dimensional (4D/5D/6D) designs are expressed as k‑ary n‑cube, where k is the length of each dimension and n is the number of dimensions. Advantages include lower latency, better locality, and reduced network diameter compared with CLOS. Drawbacks involve unpredictable performance, complex scaling (requiring full re‑wiring), fewer alternative paths than Fat‑Tree, and more challenging fault diagnosis.
Overall, Fat‑Tree excels in non‑blocking bandwidth but incurs higher cost; Dragonfly offers a balanced trade‑off of low diameter and moderate cost with sophisticated routing; Torus delivers low latency and simplicity but can be harder to scale and manage. Selecting the appropriate topology depends on workload communication patterns, scalability requirements, and budget constraints.
Architects' Tech Alliance
Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.