AI Cyberspace
AI Cyberspace
Feb 24, 2025 · Cloud Computing

Scaling AI Training: Inside Large-Scale RDMA Networks and Modern Congestion Controls

This article explores the hardware and networking foundations for training massive AI models, detailing the challenges of large‑scale RDMA deployment, the evolution of congestion‑control algorithms like DCQCN, TIMELY, HPCC, and AWS's SRD, and how hardware offload and programmable switches enable scalable, low‑latency AI infrastructure.

AWS SRDCongestion ControlDCQCN
0 likes · 14 min read
Scaling AI Training: Inside Large-Scale RDMA Networks and Modern Congestion Controls
AI Cyberspace
AI Cyberspace
Feb 22, 2025 · Cloud Computing

Why RoCEv2 Needs a Lossless Network and How to Achieve It

RoCE, originally built for InfiniBand, was adapted to Ethernet as RoCEv2, which uses IP/UDP headers to enable L3 routing but is highly sensitive to packet loss, requiring a lossless network and employing technologies such as PFC, ECN, DCQCN, and multi‑path transmission to maintain high RDMA performance.

Congestion ControlDCQCNECN
0 likes · 17 min read
Why RoCEv2 Needs a Lossless Network and How to Achieve It