How Huawei’s CloudMatrix 384 Challenges Nvidia’s AI Supercomputers
Huawei’s CloudMatrix 384, built from 384 Ascend 910C chips and a multi‑to‑multi topology, delivers up to 300 PFLOP BF16 performance—nearly twice that of Nvidia’s GB200 NVL72—while exposing supply‑chain dependencies on foreign fabs, higher power consumption, and a rapid push to scale China’s domestic semiconductor capabilities.
Huawei has introduced a new AI accelerator and rack‑level architecture called CloudMatrix 384, which uses Ascend 910C chips to create a system that directly competes with Nvidia’s GB200 NVL72 and, in some metrics, surpasses it.
The recent update adds CPU evolution (Intel/AMD and domestic designs), GPU updates (from Fermi to Hopper), memory and storage improvements, fixes known issues, and provides more than 40 pages of PPT material.
CloudMatrix 384 consists of 384 Ascend 910C chips connected via a multi‑to‑multi topology. The system delivers 300 PFLOP of dense BF16 compute—almost double the performance of GB200 NVL72—along with 3.6× total memory capacity and 2.1× memory‑bandwidth advantage, though its power draw is 4.1× higher.
While the Ascend 910C chips are designed in China, their manufacturing still relies heavily on foreign fabs such as Samsung (for HBM), TSMC (for wafers), and equipment from the United States, the Netherlands, and Japan. Huawei secured roughly 13 million HBM stacks from Samsung before the HBM export ban, enough for about 1.6 million Ascend 910C packages.
Domestic fabs like SMIC and CXMT are expanding rapidly; SMIC’s monthly capacity is approaching 50 k wafers and could increase further if yield improves, potentially supplying a significant portion of the required chips.
The full CloudMatrix system spans 16 racks: 12 compute racks each host 32 GPUs, and four vertical‑expansion switch racks provide inter‑rack connectivity using fiber optics. This architecture enables scaling to hundreds of GPUs, a feat that Nvidia’s DGX H100 NVL256 “Ranger” platform could not achieve due to cost, power, and networking complexity.
Overall, Huawei’s solution demonstrates that system‑level engineering—covering networking, optics, and software—can offset chip‑level performance gaps, but the approach faces challenges from higher power consumption and continued dependence on foreign semiconductor supply chains.
Key Specifications
384 Ascend 910C chips
300 PFLOP BF16 compute
3.6× memory capacity, 2.1× bandwidth
Power consumption 4.1× that of GB200 NVL72
Supply‑Chain Highlights
HBM sourced mainly from Samsung (13 million stacks)
Wafer production largely on TSMC’s 7 nm process
Domestic fabs (SMIC, CXMT) expanding capacity and equipment base
Architects' Tech Alliance
Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.