Industry Insights 23 min read

What Meta’s RDMA‑over‑Ethernet Paper Reveals About Scaling AI Training Networks

This article provides a detailed technical analysis of Meta's SIGCOMM paper on RDMA over Ethernet for large‑scale AI training, examining the physical network deployment, congestion‑control mechanisms, topology choices, routing strategies, hardware design, and the practical challenges that remain.

Baobao Algorithm Notes

Aug 13, 2024

What Meta’s RDMA‑over‑Ethernet Paper Reveals About Scaling AI Training Networks

TL;DR

Meta’s paper describes a RoCE‑based AI training network built on a symmetric spine‑leaf topology, enhanced ECMP routing, and a receiver‑driven congestion‑control mechanism that replaces DCQCN.

Key Design Choices

Spine‑leaf symmetric topology using copper DACs between TOR (RTSW) and servers, avoiding expensive 400 G optical modules.

Each server (Grand Teton) hosts eight 400 G RDMA NICs; pods contain 3072 H100 GPUs (16 × 192 cabinets) with a maximum 3‑hop path.

Cluster switches (CTSW) are Arista 7800 frame‑based switches built on Broadcom Jericho‑3 chips with large on‑chip buffers and HBM, supporting up to 576 × 400 G ports via VOQ scaling.

Enhanced ECMP distributes traffic by increasing NCCL_IB_QPS_PER_CONNECTION and by programming switch UDFs to hash on the RoCE destination QP.

Receiver‑driven Clear‑to‑Send (CTS) flow control limits inflight traffic using high‑priority queues on CTSW.

Spine redundancy ratio 1:1.125 tolerates two spine failures; 12.5 % extra links added for fault tolerance.

Observed Benefits

Cost reduction by using DACs instead of optical modules.

Failure domains limited to a few thousand GPUs; fast migration after a GPU or TOR failure.

Large buffers on CTSW absorb collective‑communication bursts, enabling high utilization (~80 %) of the fabric.

Window‑based congestion control in NCCL improves large‑incast performance.

Remaining Challenges

Frame‑based switches increase network cost by ~30 %.

Static latency of 22 µs limits small‑message All‑to‑All performance.

RoCE multi‑path hashing issues remain unresolved; single‑port 400 G NICs raise reliability concerns.

DCQCN is effectively moved from NIC buffers to CTSW, requiring careful CTS configuration.

Solution relies on extensive configuration hacks; not a plug‑and‑play deployment.

Flowlet‑level switching and fault‑convergence speed are still exploratory.

Hardware Overview

Training nodes use the Grand Teton platform with eight 400 G RDMA NICs per server. The network consists of:

RTSW (TOR) switches connected to servers via DAC cables.

CTSW (cluster) switches based on Arista 7800 / Broadcom Jericho‑3, providing deep buffers and HBM.

Each pod: 3072 H100 GPUs, 86 Tbps aggregate bandwidth, exceeding a single Jericho‑3 chip’s 51.2 Tbps limit through multi‑stage VOQ.

Routing Mechanisms

Path Pinning

Meta slices traffic on RTSW to pin packets to specific paths; this works only in fault‑free conditions, otherwise performance can drop up to 30 %.

Enhanced ECMP

Increasing NCCL_IB_QPS_PER_CONNECTION=16 spreads traffic across more QPs but can degrade collective performance. Switches are programmed via UDF to include the RoCE destination QP in the hash, achieving >97.5 % link utilization with 1–2 QPs.

Centralized Traffic Engineering

A collector discovers topology, runs CSPF, and installs policy‑based routing rules matching <source port, destination prefix>. This approach contrasts with decentralized SDN designs.

Future Direction: Flowlet Switching

Meta mentions flowlet‑level switching as exploratory; Broadcom’s DLB solution approximates packet‑spray at the flowlet granularity.

Transport Layer and Congestion Control

Receiver‑Driven CTS

DCQCN and PFC perform poorly at 400 Gbps. Meta replaces them with a Clear‑to‑Send (CTS) mechanism: receivers send CTS messages that grant senders permission to transmit, using high‑priority queues on CTSW.

Experiments show CTS improves large‑incast latency compared to vanilla Perftest.

Discussion

While DCQCN remains the “gold standard” for storage‑oriented RoCE, Meta’s workload demonstrates that custom congestion algorithms combined with deep switch buffers can replace it, albeit with a heavy reliance on configuration hacks. A clean window‑based controller on the NIC would be preferable.

Operational Experience

Co‑tuning of NCCL and Network

CTS‑driven design yields 22 µs idle latency, harming small‑message collectives. Mitigations include reducing NCCL channels, using LL128/LL protocols, or tree‑based algorithms, and tuning NIC/PCIe credits.

Impact of Routing and Topology

Four phases of bandwidth scaling are described, ending with a 1:1.125 convergence ratio to handle link failures, achieving ~80 % fabric utilization.

Observability

Metrics collected include NIC out‑of‑order packets, switch buffer counters, PFC watchdogs, and connectivity pings. Faults such as CTSW software bugs and SRAM buffer limits are monitored.

References

RDMA over Ethernet for Distributed AI Training at Meta Scale: https://dl.acm.org/doi/10.1145/3651890.3672233

A Decentralized SDN Architecture for the WAN: https://research.google/pubs/a-decentralized-sdn-architecture-for-the-wan/

Video for RDMA over Ethernet for Distributed AI Training at Meta Scale: https://www.youtube.com/watch?v=wLW3UzUw5rY

Optimized Network Architectures for Large Language Model Training with Billions of Parameters: https://arxiv.org/pdf/2307.12169v2

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

network architecture RDMA AI training congestion control

Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.