Cloud Computing 27 min read

Predictable Network: Alibaba Cloud’s Ethernet Edge for Faster AI Training

This article examines the challenges of scaling AI model training beyond single-chip limits, introduces Alibaba Cloud’s Predictable Network architecture—including high‑performance Ethernet, dual‑uplink, and adaptive routing—and compares its performance, scalability, and reliability against InfiniBand, showing how Ethernet can meet AI workloads with minimal loss.

Alibaba Cloud Big Data AI Platform

Jun 19, 2023

Predictable Network: Alibaba Cloud’s Ethernet Edge for Faster AI Training

Predictable Network for AI

General artificial intelligence is drawing ever closer, and worldwide attention and investment are driving rapid changes. The birth and evolution of AI have always been tied to advances in computing power, and the current large‑model boom requires models with tens of billions of parameters, far beyond the capacity of a single chip. Ultra‑large compute clusters have become the key infrastructure supporting technological progress and application innovation.

Challenge: Scaling Compute and Network

How can we break through the compute limits of a single chip or server node while ensuring that performance scales linearly in massive clusters? Alibaba Cloud’s Infrastructure Business Unit introduced Predictable Network (PN) to meet the high‑efficiency data‑exchange demands of AI, big‑data analytics, and high‑performance computing. Unlike traditional “best‑effort” networks, PN provides QoS‑guaranteed throughput and latency, making performance metrics predictable.

AI Model Training: Bandwidth and Latency Sensitivity

Typical video or chat services require end‑to‑end latency of tens of milliseconds and bandwidth in the tens of Gbps. Distributed AI training, however, demands far higher bandwidth and ultra‑low latency because each training task is split across many nodes and iterated many times. For example, training GPT‑3 175B on 128 GPU servers illustrates the massive parallelism and data‑synchronization required.

Stability Challenges in Ultra‑Large Systems

Training large models can take hundreds to thousands of hours; any hardware fault or network glitch can halt progress and require costly recovery. As cluster size grows, the probability of failures (network jitter, board or GPU faults) rises, making stability a universal challenge for AI‑focused compute clusters.

Serverless and Compute Serviceification

Cloud computing offers flexible, scalable, low‑cost compute services. Container‑based deployment and network virtualization are essential for multi‑tenant AI workloads, but traditional RDMA virtualization incurs performance penalties. Alibaba Cloud’s PN addresses these issues with a lossless‑like Ethernet design that avoids high virtualization overhead.

Infiniband vs. Ethernet: The Core Decision

Infiniband (IB) originated from high‑performance computing and excels in raw bandwidth and low latency, but it is dominated by a single vendor and lacks a broad ecosystem. Ethernet, backed by the Ethernet Alliance, enjoys a vibrant, multi‑vendor ecosystem and rapid evolution. After extensive analysis, Alibaba Cloud chose Ethernet for its openness, scalability, and cost advantages.

Performance Comparison

Benchmarks show that IB has lower latency for small messages (≈1 µs advantage) and comparable throughput for large messages. However, when Ethernet is optimized with lossless techniques and high‑precision congestion control, the performance gap narrows to about 5 % even with a single‑node 800 Gbps Ethernet versus a 1.6 Tbps IB link.

Transmission Optimization

IB uses credit‑based flow control to guarantee lossless transmission, while Ethernet employs Priority‑based Flow Control (PFC). Both approaches can cause congestion‑induced deadlocks at scale. Alibaba Cloud’s PN adopts a lossy Ethernet combined with a high‑precision congestion control algorithm (HPCC) that achieves lossless‑like performance without PFC.

Load Balancing

IB leverages adaptive routing to distribute traffic across multiple paths based on real‑time load. Ethernet traditionally uses ECMP hashing, which can lead to uneven load for large AI traffic flows. Optimized Ethernet designs incorporate per‑packet or flowlet‑level balancing to mitigate hash‑induced latency spikes.

Model Training Performance

Training speed—measured in tokens processed per second—is the most direct metric. NVIDIA’s public IB results and Alibaba Cloud’s Ethernet tests show that, after optimization, Ethernet’s training speed is within 5 % of IB’s, despite Ethernet’s lower raw bandwidth.

Cluster Scalability

Both IB and Ethernet can scale to clusters with over ten thousand GPUs. However, Ethernet’s higher switch chip forwarding capacity and two‑year doubling trend make it more future‑proof for “ten‑thousand‑card” clusters.

HPN Architecture (High‑Performance Network)

HPN tackles stability, scalability, and load‑balancing challenges with three core innovations:

Dual‑uplink design provides lossless upgrade and fault tolerance.

Dual‑plane forwarding eliminates hash polarization by mapping each uplink to a separate network plane.

Adaptive routing (per‑packet or flowlet) balances traffic across paths while controlling packet reordering.

Solar‑RDMA and ReLSA

Solar‑RDMA is Alibaba’s custom high‑performance RDMA protocol that provides multi‑path transmission, fast failover, and fine‑grained congestion control (HPCC). ReLSA (Remote Load/Store/Atomic) uses memory‑mapped I/O to let CPUs directly access remote memory, reducing latency for gradient synchronization by ~50 %.

ACCL and C4

ACCL is a high‑performance collective communication library that is topology‑aware, keeping most traffic within the same access switch to reduce cross‑rack contention. It implements a hybrid AllReduce algorithm (Halving‑Doubling) that outperforms Ring by >20 % in large‑scale training. C4 is a scheduling framework that coordinates multiple collective operations across tasks, cutting collective communication time by up to 49 % and boosting GPU utilization by >67 %.

Nimitz and NUSA

Nimitz is a RDMA‑enabled container network that supports up to 15 K servers in a single Kubernetes cluster, providing high‑performance, secure, and elastic networking for serverless workloads. NUSA (Unified Network Service Platform) automates RDMA deployment, monitoring, fault detection, and remediation, delivering an out‑of‑the‑box RDMA experience.

Conclusion

While InfiniBand still offers the best raw performance, a well‑optimized Ethernet solution can achieve comparable results with far lower cost, better scalability, and a richer ecosystem. Alibaba Cloud’s Predictable Network, built on Ethernet, provides the stability, linear scalability, and QoS guarantees required for next‑generation AI training and large‑scale compute workloads.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AI training cloud infrastructure high‑performance networking Predictable Network Ethernet vs InfiniBand

Written by

Alibaba Cloud Big Data AI Platform

The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.