How Alibaba’s ALink System and UPN512 Architecture Redefine AI Scale‑Up Networking

The article explains Alibaba’s ALink System, detailing its data‑plane ALS‑D and control‑plane ALS‑M, the backplane‑free orthogonal hardware design, copper and optical interconnect layers, and the UPN512 architecture’s optical options, transmission semantics, and in‑network computing techniques that together reshape AI scale‑up networking.

Architects' Tech Alliance
Architects' Tech Alliance
Architects' Tech Alliance
How Alibaba’s ALink System and UPN512 Architecture Redefine AI Scale‑Up Networking

ALink System Overview

ALink System consists of ALS‑D (data plane) and ALS‑M (control plane). The data plane provides high‑speed GPU‑to‑GPU data transfer using UALink, supporting memory‑semantic access and in‑network computing such as All‑Reduce directly on the switch.

The control plane offers a unified software interface for various AI accelerator chips (Alibaba’s PPU, third‑party GPUs, etc.), enabling multi‑tenant isolation through virtual networks and QoS policies.

ALink Switch is tightly integrated with Alibaba’s CIPU 2.0, which manages compute, storage, and network resources and works with high‑performance NICs to achieve petabit‑per‑second bandwidth and sub‑nanosecond latency.

Hardware Architecture

The system adopts a “backplane‑free orthogonal” design: two ALink‑Switch nodes are mounted vertically in a cabinet, each hosting eight switch trays. Sixteen GPU boards (64 GPUs total) are placed orthogonally to the switches, eliminating traditional backplane cabling and shortening signal paths.

Layered Interconnect

Data transmission uses two layers: a copper layer for intra‑node communication (224 Gb/s per switch chip via CPO connectors) and an optical layer for inter‑node communication, providing TB‑scale bandwidth and PB‑scale shared memory across cabinets.

UPN512 Architecture

UPN512 is Alibaba’s scale‑up architecture described in the “UPN512 Technical Architecture Whitepaper v1.0”. It relies on high‑radix Ethernet, optical interconnect, and a single‑layer switching protocol to simplify networking for up to 1 K xPU devices.

Three optical interconnect options are discussed:

FRO : traditional pluggable module with DSP; high latency, power, and cost – not suitable for scale‑up.

LPO : linear pluggable, DSP‑free, uses advanced SerDes; reduces latency, power, and cost by ~30%.

NPO : near‑chip optical engine placed on PCB, offers higher bandwidth density, lower cost, and easier ecosystem integration; further reduces power and latency compared with LPO.

Bandwidth density comparisons show NPO can achieve 200‑300 Tb/s per rack unit, far exceeding LPO’s limits.

Transmission Semantics

UPN512 defines three semantics for different data sizes:

Memory semantics (Load/Store) : small, low‑latency accesses driven directly by the xPU’s load/store unit.

Message semantics (Send/Recv) : asynchronous large‑block transfers using DMA engines, suitable for model weights or gradients.

Tensor semantics (Push/Pull) : optimized for 1‑100 KB tensors in large‑model training, providing asynchronous I/O, batch/streaming modes, explicit/implicit confirmation, minimal latency, dynamic compression, and in‑network computing.

In‑Network Computing

To offload collective operations from xPUs, UPN512 equips Ethernet switches with compute capabilities supporting data types (INT32, Float8/16/32, BFloat16) and operations (Min, Sum, Max). Protocol extensions add a “computational header” to any Ethernet‑based scale‑up protocol, keeping the protocol and compute logic decoupled.

Two collective communication flows are defined:

Symmetric collectives (e.g., AllReduce, AllGather)

Form a virtual address group V_G and register a multicast group on the switch.

Each xPU maps V_G to its real memory address.

Push tensors to the switch, which broadcasts or aggregates them, then stores results using the pre‑computed addresses.

Asymmetric collectives (e.g., Dispatch/Combine)

Establish V_G and per‑xPU counters.

Dispatch phase: the sender pushes tensors with destination xPU IDs; the switch forwards to the appropriate receivers, which store data using their counters.

Combine phase: receivers pull required tensors; the switch aggregates and returns the result.

These mechanisms aim to let the network perform more work, reducing compute overhead on the xPUs.

Network ArchitectureAIAlinkInterconnectUPN512
Architects' Tech Alliance
Written by

Architects' Tech Alliance

Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.