Inside Huawei’s CloudMatrix384: How a 384‑NPU AI Supernode Achieves Sub‑Microsecond Latency
The article details Huawei’s CloudMatrix384 AI supernode, describing its 384 Ascend 910C NPUs, 192 Kunpeng CPUs, ultra‑high‑bandwidth UB network, three complementary network planes (UB, RDMA, VPC), and the non‑blocking topology that enables sub‑microsecond inter‑node latency across a 16‑rack deployment.
CloudMatrix384 is designed as an AI supernode integrating 384 Ascend 910C NPU chips (Huawei’s NPU) and 192 Kunpeng CPUs.
Architecture Features
A key characteristic is its point‑to‑point, fully interconnected, ultra‑high‑bandwidth network using the UB protocol, which connects all NPUs and CPUs. The 384 NPUs and 192 CPUs are linked via UB switches, achieving inter‑node latency of less than 1 µs. Huawei argues that current AI workloads are bandwidth‑intensive rather than latency‑sensitive, though this view may vary with model types and scenarios.
Three Network Planes
1) UB Plane
The UB plane forms the primary high‑bandwidth Scale‑UP structure inside the supernode, providing a non‑blocking full‑mesh topology that directly interconnects every NPU and CPU. Each Ascend 910C contributes over 392 GB/s of unidirectional bandwidth (via fourteen 400 Gbps Ethernet interfaces). UB supports memory‑semantic operations such as fine‑grained parallelism (TP and EP) and rapid point‑to‑point access to pooled memory across CPUs and NPUs, essential for caching model weights and KV caches.
2) RDMA Plane
The RDMA plane enables Scale‑OUT communication between the CloudMatrix384 supernode and external RDMA‑compatible systems. It currently uses RoCE over fused Ethernet, ensuring compatibility with standard RDMA stacks. Each NPU provides up to 400 Gbps of unidirectional RDMA bandwidth, dedicated to NPU‑related message transport, while control and storage traffic are isolated on the third VPC plane.
3) VPC Plane
The Virtual Private Cloud (VPC) plane connects the supernode to the broader data‑center network via high‑speed NICs (Huawei’s QingTianHe cards), offering 400 Gbps per node. It operates on standard Ethernet/IP and can optionally switch to an UB‑over‑Ethernet mode. VPC handles management and control operations (deployment, monitoring, scheduling), provides access to persistent storage services (OBS, EVS, SFS), and facilitates external service communication from CPU‑resident workloads such as databases and user interfaces.
Implementation Details
Each compute node integrates eight Ascend 910C NPUs, four Kunpeng CPUs, and seven onboard UB switch chips (each 19.2 Tbps, supporting 192 × 112 G SerDes and 48 × 400 G ports). Each NPU receives up to 392 GB/s of UB bandwidth, while each CPU slot receives about 160 GB/s. A single onboard UB switch offers 448 GB/s upstream capacity to the next switch layer, though some capacity may be underutilized.
Supernode Scale and Topology
The supernode spans 16 racks: twelve compute racks hosting 48 Ascend 910C nodes (totaling 384 NPUs) and four communication racks containing second‑level (L2) UB switches (each 19.2 Tbps, supporting 48 × 400 G ports) that interconnect all nodes. The topology maps the seven L1 UB switches inside each node to seven L2 sub‑planes; each L2 sub‑plane comprises 16 L2 UB chips with 48 × 400 G ports. Each L1 switch fans out via 16 links—one to each L2 chip in its sub‑plane—ensuring that aggregate uplink bandwidth matches internal UB capacity, preserving a non‑blocking design throughout the supernode.
Architects' Tech Alliance
Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
