Scale‑up x10 Drives a New Wave of AI Compute Cluster Network Architecture
At the CCF ChinaNet conference, Alibaba Cloud’s VP of R&D presented a vision of AI compute scaling to ten‑fold larger clusters, highlighting the shift from InfiniBand to high‑throughput Ethernet, the HPN7.0 architecture, emerging Scale‑up challenges, and the roadmap for high‑throughput Ethernet and the ENode+ super‑node system.
From November 8‑10, the CCF ChinaNet conference in Zhangjiagang gathered academicians, professors, and industry leaders to discuss the future of networking for AI compute clusters. Alibaba Cloud’s Vice President of R&D, Cai Dezhi, delivered a keynote titled “Scale‑x10 Drives a New Round of AI Compute Cluster Network Architecture”, outlining trends in AI scaling, the need for higher bandwidth interconnects, and the roadmap for high‑throughput Ethernet and the ENode+ super‑node system.
The continuous scaling of models and datasets has increased total compute requirements by 4‑6× annually, while single‑chip performance and network bandwidth still follow Moore’s law, doubling only every two years. Consequently, clustering more GPUs and improving inter‑GPU networking has become critical.
In early 2023, AI cluster networking solutions were diverse: Google used a private TPU protocol, Microsoft adopted NV‑provided InfiniBand, while Alibaba Cloud and AWS favored open Ethernet. Alibaba Cloud reinforced its Ethernet strategy by designing the HPN7.0 architecture, featuring a self‑developed 51.2 Tbps switch, multi‑track, dual‑uplink, dual‑plane topology, and a custom communication library, protocol, and flow‑control. The HPN7.0 paper was accepted at SIGCOMM, marking the first AI‑focused network architecture paper at the conference.
Within a year, HPN7.0 became an industry benchmark, steering Ethernet‑based AI cluster development worldwide. Major North American firms are moving toward Ethernet for hundred‑thousand‑GPU clusters, effectively ending the Ethernet vs InfiniBand debate.
Looking ahead, the emergence of X10‑scale clusters will introduce new networking challenges. Scaling GPU clusters is not merely about adding more GPUs; hardware failures, power and space constraints, and geographic distribution increase latency and bandwidth pressures, demanding aggressive network planning for both Scale‑up and Scale‑out.
Scale‑up refers to ultra‑high‑bandwidth interconnects within a bounded cost and technology envelope, offering bandwidth several times higher than Scale‑out and supporting memory‑semantic operations. Contrary to common belief, Scale‑up is not intra‑node interconnect; for example, an NVL‑72 rack comprises 18 servers linked by nine Scale‑up switches, delivering 7.2 Tbps (≈10× Scale‑out) and memory‑semantic capabilities.
Two main technical directions exist for Scale‑up: private protocols (NVLink, TPU) and Ethernet‑based solutions from cloud providers and hardware vendors (Microsoft, Meta, Tesla, AMD, Intel). Ethernet’s ultra‑high bandwidth, low latency, and in‑network computing, especially after protocol upgrades by UEC and high‑throughput Ethernet initiatives, make it the preferred choice for new Scale‑up systems.
The integration of Scale‑up and Scale‑out is crucial for optimal cluster performance. As Ethernet becomes the dominant Scale‑up solution, a fused architecture that shares bandwidth across both domains can reduce costs, simplify operations, and improve efficiency.
At the conference, Alibaba Cloud announced the high‑throughput Ethernet protocol roadmap with annual major releases and semi‑annual minor updates, and unveiled the ENode+ super‑node plan to accelerate the ecosystem’s development.
Alibaba Cloud Infrastructure
For uninterrupted computing services
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.