How Huawei’s New Atlas Supernodes Redefine AI Compute Power
Huawei’s 2025 Full‑Connection Conference unveiled the Atlas 950 and Atlas 960 SuperPoD supernodes, detailing their massive card counts, unprecedented compute, memory and bandwidth capabilities, and explaining how their full‑stack hardware‑software design dramatically accelerates large‑model AI training and inference.
At the Huawei Full‑Connection Conference (September 18‑20, 2025) in Shanghai, the company announced its latest AI supernode products: the Atlas 950 SuperPoD and Atlas 960 SuperPoD. The Atlas 950 supports 8,192 Ascend 950DT cards, while the Atlas 960 supports 15,488 Ascend 960 cards, surpassing competitors in card scale, total compute, memory capacity, and interconnect bandwidth.
Based on these supernodes, Huawei also introduced the Atlas 950 SuperCluster and Atlas 960 SuperCluster, whose compute scales exceed 500,000 cards and reach one million cards respectively, providing a “supernode + cluster” solution built on domestically available chip manufacturing processes to meet China’s rapidly growing compute demand.
The supernode concept originated earlier this year with the Atlas 900, which integrates 384 Ascend 910C chips for a peak performance of 300 PFLOPS and underpins the CloudMatrix384 cloud service instance. The Atlas 950 represents a more than 20‑fold increase in card count over the Atlas 900 and is slated for market release in Q4 2026.
Compared with Nvidia’s planned NVL144 (expected H2 2026), the Atlas 950’s card count is 56.8 times larger, its total compute 6.7 times higher, memory 15 times larger, and interconnect bandwidth 62 times greater. The Atlas 960 further doubles the resources of the Atlas 950, delivering over three‑fold training performance and four‑fold inference performance improvements, with a planned release in Q4 2027.
A modern supernode is not only hardware‑rich but also relies on a deeply optimized full‑stack software layer that amplifies hardware capabilities. This stack includes hardware abstraction and driver layers, specially tuned communication libraries, and compute‑ and memory‑optimization techniques, forming a cohesive hardware‑software co‑design.
Technical advantages include high‑bandwidth, low‑latency interconnects: traditional architectures using PCIe or Ethernet provide 200‑400 Gb/s bandwidth with tens of microseconds latency, whereas Huawei’s supernodes achieve a 15‑fold bandwidth increase and reduce single‑hop latency from 2 µs to 200 ns, yielding more than a three‑fold speedup on large‑scale multimodal models such as DeepSeek and Qwen.
The supernode’s global memory management assigns unique addresses across all interconnected devices, enabling direct memory‑semantic communication that bypasses the conventional serialize‑transfer‑deserialize pipeline, thus improving small‑packet transmission and random‑access efficiency during parameter synchronization.
In AI system terms, a supernode can host 32 or more AI chips, with inter‑chip bandwidth of at least 400 GB/s and switch latency under 500 ns. Unified memory addressing allows any AI chip to directly access another chip’s memory via memory‑semantic operations.
For large‑scale “Scale‑Up” networking, supernodes introduce a dedicated communication domain backed by custom switches, breaking single‑node hardware limits and enabling massive compute aggregation. Stability, fine‑grained resource scheduling, performance isolation, and data‑security mechanisms are essential to ensure continuous operation across diverse workloads.
Architects' Tech Alliance
Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
