What Meta’s RDMA‑over‑Ethernet Paper Reveals About Scaling AI Training Networks
This article provides a detailed technical analysis of Meta's SIGCOMM paper on RDMA over Ethernet for large‑scale AI training, examining the physical network deployment, congestion‑control mechanisms, topology choices, routing strategies, hardware design, and the practical challenges that remain.
TL;DR
Meta’s paper describes a RoCE‑based AI training network built on a symmetric spine‑leaf topology, enhanced ECMP routing, and a receiver‑driven congestion‑control mechanism that replaces DCQCN.
Key Design Choices
Spine‑leaf symmetric topology using copper DACs between TOR (RTSW) and servers, avoiding expensive 400 G optical modules.
Each server (Grand Teton) hosts eight 400 G RDMA NICs; pods contain 3072 H100 GPUs (16 × 192 cabinets) with a maximum 3‑hop path.
Cluster switches (CTSW) are Arista 7800 frame‑based switches built on Broadcom Jericho‑3 chips with large on‑chip buffers and HBM, supporting up to 576 × 400 G ports via VOQ scaling.
Enhanced ECMP distributes traffic by increasing NCCL_IB_QPS_PER_CONNECTION and by programming switch UDFs to hash on the RoCE destination QP.
Receiver‑driven Clear‑to‑Send (CTS) flow control limits inflight traffic using high‑priority queues on CTSW.
Spine redundancy ratio 1:1.125 tolerates two spine failures; 12.5 % extra links added for fault tolerance.
Observed Benefits
Cost reduction by using DACs instead of optical modules.
Failure domains limited to a few thousand GPUs; fast migration after a GPU or TOR failure.
Large buffers on CTSW absorb collective‑communication bursts, enabling high utilization (~80 %) of the fabric.
Window‑based congestion control in NCCL improves large‑incast performance.
Remaining Challenges
Frame‑based switches increase network cost by ~30 %.
Static latency of 22 µs limits small‑message All‑to‑All performance.
RoCE multi‑path hashing issues remain unresolved; single‑port 400 G NICs raise reliability concerns.
DCQCN is effectively moved from NIC buffers to CTSW, requiring careful CTS configuration.
Solution relies on extensive configuration hacks; not a plug‑and‑play deployment.
Flowlet‑level switching and fault‑convergence speed are still exploratory.
Hardware Overview
Training nodes use the Grand Teton platform with eight 400 G RDMA NICs per server. The network consists of:
RTSW (TOR) switches connected to servers via DAC cables.
CTSW (cluster) switches based on Arista 7800 / Broadcom Jericho‑3, providing deep buffers and HBM.
Each pod: 3072 H100 GPUs, 86 Tbps aggregate bandwidth, exceeding a single Jericho‑3 chip’s 51.2 Tbps limit through multi‑stage VOQ.
Routing Mechanisms
Path Pinning
Meta slices traffic on RTSW to pin packets to specific paths; this works only in fault‑free conditions, otherwise performance can drop up to 30 %.
Enhanced ECMP
Increasing NCCL_IB_QPS_PER_CONNECTION=16 spreads traffic across more QPs but can degrade collective performance. Switches are programmed via UDF to include the RoCE destination QP in the hash, achieving >97.5 % link utilization with 1–2 QPs.
Centralized Traffic Engineering
A collector discovers topology, runs CSPF, and installs policy‑based routing rules matching <source port, destination prefix>. This approach contrasts with decentralized SDN designs.
Future Direction: Flowlet Switching
Meta mentions flowlet‑level switching as exploratory; Broadcom’s DLB solution approximates packet‑spray at the flowlet granularity.
Transport Layer and Congestion Control
Receiver‑Driven CTS
DCQCN and PFC perform poorly at 400 Gbps. Meta replaces them with a Clear‑to‑Send (CTS) mechanism: receivers send CTS messages that grant senders permission to transmit, using high‑priority queues on CTSW.
Experiments show CTS improves large‑incast latency compared to vanilla Perftest.
Discussion
While DCQCN remains the “gold standard” for storage‑oriented RoCE, Meta’s workload demonstrates that custom congestion algorithms combined with deep switch buffers can replace it, albeit with a heavy reliance on configuration hacks. A clean window‑based controller on the NIC would be preferable.
Operational Experience
Co‑tuning of NCCL and Network
CTS‑driven design yields 22 µs idle latency, harming small‑message collectives. Mitigations include reducing NCCL channels, using LL128/LL protocols, or tree‑based algorithms, and tuning NIC/PCIe credits.
Impact of Routing and Topology
Four phases of bandwidth scaling are described, ending with a 1:1.125 convergence ratio to handle link failures, achieving ~80 % fabric utilization.
Observability
Metrics collected include NIC out‑of‑order packets, switch buffer counters, PFC watchdogs, and connectivity pings. Faults such as CTSW software bugs and SRAM buffer limits are monitored.
References
RDMA over Ethernet for Distributed AI Training at Meta Scale: https://dl.acm.org/doi/10.1145/3651890.3672233
A Decentralized SDN Architecture for the WAN: https://research.google/pubs/a-decentralized-sdn-architecture-for-the-wan/
Video for RDMA over Ethernet for Distributed AI Training at Meta Scale: https://www.youtube.com/watch?v=wLW3UzUw5rY
Optimized Network Architectures for Large Language Model Training with Billions of Parameters: https://arxiv.org/pdf/2307.12169v2
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Baobao Algorithm Notes
Author of the BaiMian large model, offering technology and industry insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
