Cloud Computing 14 min read

Scaling AI Training: Inside Large-Scale RDMA Networks and Modern Congestion Controls

This article explores the hardware and networking foundations for training massive AI models, detailing the challenges of large‑scale RDMA deployment, the evolution of congestion‑control algorithms like DCQCN, TIMELY, HPCC, and AWS's SRD, and how hardware offload and programmable switches enable scalable, low‑latency AI infrastructure.

AI Cyberspace
AI Cyberspace
AI Cyberspace
Scaling AI Training: Inside Large-Scale RDMA Networks and Modern Congestion Controls

Preface

This article is the final piece in the AI Infrastructure series, focusing on the hardware foundations required to train large AI models, covering GPUs, GPU servers, and RDMA networking.

Table of Contents

Large‑Scale RDMA Networking Challenges

With the rise of AI large‑model training, demand for RDMA networks has surged. Cloud providers have been researching large‑scale RDMA deployments to achieve scale‑out and self‑service models.

The main RDMA solutions are NVIDIA InfiniBand and RoCEv2. InfiniBand faces cost and scale limits, while RoCEv2 offers lower cost and an open ecosystem, making it the preferred choice for large‑scale deployments.

Deep Integration of Algorithms and Hardware

Innovations in Congestion‑Control Algorithms

Previous work identified deadlock and pause‑storm issues with PFC and ECN latency in RoCEv2. Optimizing congestion‑control algorithms is a key avenue for scaling.

Industry is designing new congestion‑control algorithms that leverage lossless network techniques.

Hardware Offload Innovations

SmartNIC/DPU and programmable switches now provide sufficient compute to offload congestion‑control algorithms.

In large‑scale RDMA, the number of L4 BTH QPs can exceed the limited DRAM of traditional RNICs, requiring host DRAM as a cache, which introduces latency.

Cache management challenges include slow‑receiver symptoms caused by page‑table lookups.

RNIC contains MPT and MTT modules that map virtual to physical addresses.

DMA uses these tables for address translation.

If cache exceeds RNIC DRAM, the RNIC must fetch tables from host memory, increasing latency and jitter.

New SmartNIC/DPU designs provide larger on‑chip CPU and DRAM to store RDMA data structures.

Microsoft Cloud DCQCN Quantized Congestion Notification

In 2015, Microsoft and Mellanox released DCQCN, a widely used congestion‑control algorithm based on ECN, combining QCN and DCTCP.

DCQCN defines three roles: RP (sender RNIC), CP (switch), and NP (receiver RNIC), each handling specific functions.

CP algorithm : Marks packets with ECN when queue length exceeds a threshold and adjusts DCTCP parameters.

NP algorithm : Generates CNP upon receiving ECN‑marked packets.

RP algorithm : Reduces sending rate using new_rate = old_rate * (1‑α/2).

DCQCN achieves fast convergence and fairness but requires configuring many parameters in large deployments.

Google Cloud TIMELY RTT‑Based Congestion Control

Released in 2016, TIMELY measures RTT to adjust sender rate, using gradient calculations for rate‑based control.

RTT measurement via timestamps and ACKs.

Congestion detection using high/low watermarks on RTT.

Rate adjustment based on RTT thresholds.

Smoothing to avoid abrupt changes.

Advantages: simplicity and dynamic rate adaptation; drawbacks: reliance on high‑precision clocks and complex parameter tuning.

Alibaba Cloud HPCC High‑Precision Congestion Control

Published in 2020, HPCC addresses limitations of DCQCN and TIMELY by using In‑Network Telemetry (INT) to obtain precise link load information, enabling rapid single‑step rate updates.

Key features: fast traffic increase or decrease, minimal parameters, and precise sender rate calculation based on switch‑provided metadata.

INT inserts metadata at each switch; the receiver returns it in ACKs, allowing the sender to adjust flow.

HPCC implementation uses programmable FPGA in the RNIC to run the rate‑control module.

AWS SRD Scalable Reliable Datagram Protocol

AWS designed SRD to overcome RoCEv2 PFC limitations, providing low‑latency, scalable HPC networking.

SRD runs over Elastic Fabric Adapter (EFA) and offers user‑space drivers compatible with libfabric, MPI, and NCCL.

Core ideas:

Controlled ECMP multipath transmission via packet spraying.

Reliable out‑of‑order delivery with DPU‑based reordering.

RTT‑based fast congestion control.

Hardware‑line‑speed forwarding offloaded to Nitro DPU.

Comparison chart with TCP and InfiniBand is shown.

RDMACongestion ControlRoCEv2HPCCAWS SRDDCQCNTIMELY
AI Cyberspace
Written by

AI Cyberspace

AI, big data, cloud computing, and networking.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.