Dual ToR and Dual‑Plane Designs: Boosting AI Training Performance in Large‑Scale Data Centers
The article explains how non‑stacked dual‑ToR and dual‑plane network architectures, combined with single‑chip high‑performance switches and multi‑rail host networking, dramatically improve reliability, load balance, and end‑to‑end training speed for massive AI models such as GPT‑3 175B.
Background
Traditional data‑center networks often use a single‑ToR (Top‑of‑Rack) design where each NIC port connects to a ToR via a single cable. This architecture is vulnerable to switch or link failures, which can severely impact large‑scale AI training workloads.
Non‑stacked Dual‑ToR Architecture
In a non‑stacked dual‑ToR design each NIC port is connected to two separate ToR switches in a primary‑backup configuration. Both ports share the same IP and MAC address, and the NIC’s queue‑pair (QP) context is duplicated, allowing traffic failover without interrupting active flows. Hosts replicate every ARP message to both NIC ports, keeping ARP tables synchronized on both ToRs.
Because the two ToRs are independent, they cannot use standard LACP negotiation. A custom LACP module was co‑developed with switch vendors to present identical MAC addresses while using distinct port IDs, enabling the host to treat the dual‑ToR pair as a single logical link.
Building a 1‑K‑Card Scale Layer‑2 Network
The high‑performance network (HPN) employs a 51.2 Tbit/s single‑chip Ethernet switch. In each Tier‑1 segment the switch provides 128 × 200 Gbit/s downlink ports (plus 8 spare) and 60 × 400 Gbit/s uplink ports, achieving an oversubscription ratio of roughly 1.07 : 1. Spare downlink ports are used to connect standby hosts for rapid failover.
Single‑chip switches are preferred over multi‑chip chassis switches because operational data shows the latter have 3.77 × higher critical failure rates due to their distributed architecture.
Multi‑Rail Host Networking
Each host’s eight GPUs are interconnected via a high‑bandwidth internal network that offers 4–9 × more bandwidth than the NIC’s 2 × 200 Gbit/s links. This “multi‑rail” topology, first introduced by NVIDIA, allows NICs on the same rail to share a non‑stacked switch, while different rails communicate via a combination of intra‑host and inter‑host forwarding.
Dual‑Plane Design to Eliminate Hash Polarization
In a dual‑plane configuration each dual‑ToR pair is split into two independent groups. Traffic entering any uplink of a ToR follows a deterministic path within the pod, removing hash‑based load imbalance. This reduces downlink queue lengths by 91.8 % and improves cross‑segment traffic performance by up to 71.6 %.
Testing with 512 GPUs running four AllReduce jobs showed a 34.7 % boost in collective‑communication performance.
Performance Evaluation
A proprietary large‑model (over 2 300 GPUs across 288 servers) was trained on the HPN. The job migrated from a traditional DCN+ network (19 segments) to the HPN (3 segments), achieving more than a 14.9 % increase in end‑to‑end training throughput. Cross‑segment traffic dropped by 37 %, and downstream queue lengths on aggregation switches were dramatically reduced.
Key Takeaways
Adopting non‑stacked dual‑ToR, dual‑plane, and single‑chip switch architectures provides higher fault tolerance, better load distribution, and significant performance gains for AI training at scale.
Architects' Tech Alliance
Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
