How xPU Scale‑Up Networks Are Redefining AI Training Efficiency
As AI models grow to massive scales, the demand for ultra‑high‑performance, low‑latency networking in xPU clusters intensifies, prompting a shift from dense to MoE architectures and driving the evolution of Scale‑up networks, where Alibaba Cloud’s UPN design tackles bandwidth, cost, and reliability challenges.
AI Model Scaling and Network Demands
In recent years, the rapid development of artificial intelligence (AI) has caused an exponential increase in the compute and memory requirements of large‑model training and inference tasks. To achieve higher compute performance, shorter training times, and more efficient inference, AI clusters expand compute power through high‑performance networking, moving from tens of thousands to hundreds of thousands of accelerator cards (xPU). Efficient training and inference rely on parallel strategies that drive thousands to tens of thousands of xPU to exchange data, which in turn depends on high‑performance network forwarding capabilities.
Model Structure Evolves from Dense to MoE
Driven by the need for higher model capacity efficiency and lower computational cost, large models are increasingly adopting Mixture‑of‑Experts (MoE) architectures, which replace traditional dense structures. MoE partitions the model into multiple independent expert networks and uses a gating mechanism to dynamically assign input data to specific experts. This parallel expert processing improves performance while controlling compute cost. From a networking perspective, MoE typically employs Expert Parallelism (EP), which demands ultra‑high bandwidth and ultra‑low latency; larger EP domains further boost compute efficiency, making expansive EP network communication a key trend.
From Pre‑training to Integrated Training‑Inference
The compute load of AI clusters is evolving from isolated pre‑training toward integrated training‑inference (train‑push) within the same network. Offline training, reinforcement learning, and online inference coexist, introducing distributed efficiency techniques such as PD separation, AF separation, and large‑EP inference. This coexistence of online and offline traffic, diverse parallel modes, and varying compute‑density loads complicates the network communication model and raises the requirements for train‑push network architecture.
Scaling xPU Compute via Scale‑Up
To meet the growing compute demands of models, compute‑interconnect technologies have advanced rapidly. High‑bandwidth, low‑latency network interconnects enable cluster‑level super‑node compute scaling, exemplified by NVIDIA’s GPU Scale‑Up evolving from 8‑card air‑cooled systems to 72‑card liquid‑cooled systems, and Huawei’s 384‑NPU super‑node built with UB networking.
xPU Scale‑Up Network Evolution and Challenges
Current Scale‑Up systems largely rely on copper‑cable interconnects, which are cost‑effective and stable but limited in distance, leading to dense rack designs that increase system complexity and reduce reliability. Optical interconnects are the inevitable future for larger‑scale Scale‑Up networks, yet they face two major challenges: higher cost and reliability concerns.
Cost analysis shows that for up to 64–128 xPU, copper interconnects cost roughly half of optical solutions (including switch costs). For scenarios exceeding 128 xPU, a single‑layer optical interconnect becomes more cost‑effective than a hybrid copper‑plus‑optical approach, though optical costs remain relatively high.
Reliability challenges arise from two sources: link‑level errors (addressed by FEC, LLR) and in‑flight packet loss due to link or switch failures, which require end‑to‑end retransmission mechanisms. As Scale‑Up systems grow, the probability of both error types increases, dramatically reducing mean time between failures (MTBF). Consequently, fault‑tolerant system architecture becomes essential, alongside improving interconnect reliability.
Statistical data indicate that copper‑cable link failure probability is about one‑sixth that of DSP‑based optical links, highlighting the need to enhance optical link reliability for future large‑scale deployments.
Furthermore, as xPU compute and HBM memory expand, per‑GPU Scale‑Up bandwidth grows (e.g., NVIDIA’s NVLink bandwidth has tripled across generations, reaching 1.8 TB/s bidirectional in the latest Blackwell GPUs). High bandwidth increases the proportion of GPU resources consumed by network transmission (approximately 15% in DeepEP), prompting the need for more efficient network semantics and in‑network computing to reduce compute overhead.
Alibaba Cloud UPN512 Architecture Overview
To address these challenges, Alibaba Cloud proposes the Ultra Performance Network (UPN) Scale‑Up system, extending the design principles of the High Performance Network (HPN) Scale‑Out architecture. UPN aims for “large‑scale, high‑performance, high‑reliability, low‑cost, and extensible” xPU Scale‑Up networks, decoupling reliance on dense, small‑form‑factor cabinets.
Key Design Pillars of UPN Architecture
Based on High‑Radix Ethernet: Leveraging mature Ethernet ecosystems, a single‑layer design supports up to 512 xPU (future support for 1K+), offering large scale and extensibility.
Adopting LPO/NPO Optical Interconnect: Optical links enable scale‑up while decoupling from high‑density cabinet dependencies, reducing system complexity and operational challenges. Expected benefits include >30% cost reduction and >3× reliability improvement.
Single‑Layer Switch Protocol Design: Simplified protocol design facilitates the definition of network communication semantics and in‑network computing, focusing on high‑performance communication while minimizing compute resource consumption.
Architects' Tech Alliance
Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
