Fundamentals 10 min read

Why Fat-Tree, Dragonfly, and Torus Topologies Dominate High‑Performance Computing Networks

High‑performance computing demands ultra‑low latency and massive scale, prompting a shift from traditional CLOS designs to alternative topologies such as Fat‑Tree, Dragonfly, and Torus, each offering distinct trade‑offs in bandwidth, scalability, routing complexity, and cost‑effectiveness for modern data‑center and HPC environments.

Open Source Linux
Open Source Linux
Open Source Linux
Why Fat-Tree, Dragonfly, and Torus Topologies Dominate High‑Performance Computing Networks

High‑performance computing (HPC) workloads require both low static latency and support for ultra‑large scale networks. Traditional CLOS architectures prioritize generality, sacrificing latency and cost‑effectiveness.

Fat‑Tree

Fat‑Tree is a widely used topology that provides non‑convergent bandwidth from leaf to root, enabling non‑blocking forwarding. It supports various throughput options and can be scaled by increasing network layers; a two‑level Fat‑Tree with 40‑port InfiniBand switches can connect up to 800 GPUs, while a three‑level can reach 16 000 GPUs. However, it requires many switches and links, leading to higher cost and complexity, and it does not handle one‑to‑all or all‑to‑all communication patterns well for applications such as MapReduce.

Switch‑to‑server port ratio is high; number of switches needed ≈ 5M/n (M = number of servers, n = switch ports).

Limited support for one‑to‑all and all‑to‑all traffic.

Scalability limited by core‑layer switch port count.

Dragonfly

Dragonfly, introduced by John Kim et al. (2008), is a low‑diameter, cost‑effective direct‑connect topology widely adopted in HPC and data‑center networks. Its three‑level hierarchy consists of Switch, Group, and System layers. The topology can be described by parameters p (ports to compute nodes), a (switches per group), h (inter‑group links), and g (number of groups), with a recommended balanced configuration a = 2p = 2h.

Routing algorithms include Minimal Routing (up to 3 hops), Non‑Minimal (Valiant) Routing (up to 5 hops), and Adaptive Routing (e.g., UGAL, UGAL‑L, UGAL‑G) that choose paths based on network load, offering better performance.

Provides good performance for various communication patterns and reduces hop count compared to CLOS.

Supports up to 27 million nodes with 64‑port switches, achieving only 3 hops end‑to‑end.

Scaling requires rewiring the network, increasing complexity.

Torus

Torus is a symmetric topology with low diameter, simple structure, multiple paths, and good scalability, making it suitable for collective communication. Variants such as 2D‑Torus (Sony) and 3D‑Torus (IBM) are expressed as k‑ary n‑cubes. Example: a 3‑ary 3‑cube.

Advantages: lower latency, better locality, smaller network diameter than CLOS, reducing switch count and cost.

Disadvantages: unpredictable performance, scaling may require reconfiguration, fewer alternative paths than Fat‑Tree, and fault‑diagnosis can be more complex.

Higher‑dimensional Torus (4D/5D/6D) can be built by connecting multiple silicon‑tiles.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

High‑performance computingroutingDragonflyFat-Treenetwork topologyTorus
Open Source Linux
Written by

Open Source Linux

Focused on sharing Linux/Unix content, covering fundamentals, system development, network programming, automation/operations, cloud computing, and related professional knowledge.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.