Tagged articles
2 articles
Page 1 of 1
AI Cyberspace
AI Cyberspace
Feb 24, 2025 · Cloud Computing

Scaling AI Training: Inside Large-Scale RDMA Networks and Modern Congestion Controls

This article explores the hardware and networking foundations for training massive AI models, detailing the challenges of large‑scale RDMA deployment, the evolution of congestion‑control algorithms like DCQCN, TIMELY, HPCC, and AWS's SRD, and how hardware offload and programmable switches enable scalable, low‑latency AI infrastructure.

AWS SRDDCQCNHPCC
0 likes · 14 min read
Scaling AI Training: Inside Large-Scale RDMA Networks and Modern Congestion Controls
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Nov 2, 2022 · Operations

Network Performance Anomaly Detection with In‑band Telemetry and High‑Performance Congestion Control (HPCC++) at the 2022 OCP Global Summit

At the 2022 OCP Global Summit in San Jose, Alibaba and Broadcom presented two technical talks covering in‑band telemetry‑based network performance anomaly detection and the HPCC++ congestion‑control algorithm, highlighting deployment challenges, resource trade‑offs, and real‑world data‑center use cases.

Data Center NetworkingHPCCOCP Summit
0 likes · 6 min read
Network Performance Anomaly Detection with In‑band Telemetry and High‑Performance Congestion Control (HPCC++) at the 2022 OCP Global Summit