Huawei Ascend 950 NPU Architecture Deep Dive – Full Whitepaper Inside
The article provides a detailed technical analysis of Huawei's Ascend 950 NPU series, covering its one‑chip dual‑structure for training and inference, SIMD/SIMT dual‑mode compute, ultra‑fine memory granularity, PD separation, native FP4 support, a high‑bandwidth 2.0 interconnect, and a fully self‑developed yet CUDA‑compatible ecosystem.
1. One‑Chip Dual‑Structure: Splitting Training and Inference
The Ascend 950 family uses a single die to produce two variants: 950PR, optimized for large‑model pre‑fill and recommendation workloads, and 950DT, targeting training and long‑text decoding. 950PR, mass‑produced from March 2026, features 128 GB of HiBL 1.0 high‑bandwidth memory with 1.6 TB/s bandwidth, supports FP8/MXFP8/HiF8 low‑precision formats, and delivers 1 PFLOPS of FP8 compute for fast pre‑fill and KV‑cache generation. 950DT, slated for Q4 2026, upgrades to 144 GB memory and 4 TB/s bandwidth, boosting performance 1.5× over PR and reaching 2 PFLOPS of FP4 compute, eliminating bandwidth bottlenecks in token‑wise decoding.
2. Architectural Revolution: From Da Vinci to "GPU‑like" Design
2.1 SIMD/SIMT Dual‑Mode Co‑existence
The core compute units implement a novel SIMD/SIMT dual programming model. SIMD mode processes vector data in pipelines, ideal for regular tasks such as recommendation systems and computer vision, maximizing throughput. SIMT mode handles fragmented, parallel data, fitting NLP long‑text and large‑model decoding, allowing the chip to adapt seamlessly to both structured and irregular workloads.
2.2 Memory Subsystem Optimized to 128‑Byte Granularity
Memory access granularity is reduced from the previous 512 bytes to 128 bytes, a "microscopic" optimization that cuts wasted bandwidth when handling sparse data, improving efficiency by over 30 % for large‑model decoding and recommendation scenarios.
2.3 PD Separation Architecture
The Prefill/Decode (PD) separation decouples compute and storage resources for the two phases. Prefill uses high compute, low bandwidth; Decode uses high bandwidth, low compute. This resource matching cuts inference latency by 50 % and doubles concurrency, removing the classic "one‑card‑cannot‑serve‑all" limitation.
2.4 Full‑Stack Self‑Developed + Ecosystem Compatibility
All stack layers—from instruction set to interconnect protocol—are self‑designed, while maintaining compatibility with CUDA core APIs. This enables direct migration of overseas large models without code rewrites, lowering ecosystem entry barriers and preserving security autonomy.
3. Low‑Precision Breakthrough: Native FP4 Support
Ascend 950 uniquely supports FP4 (4‑bit ultra‑low precision) alongside FP8/MXFP8/HiF8. FP4 reduces memory usage to one‑quarter of FP16 and half of FP8; a single card with 144 GB memory provides an effective 576 GB of FP16 capacity, allowing trillion‑parameter models to run on a single chip. FP4 delivers 2 PFLOPS—2.87 × the 0.543 PFLOPS of Nvidia H100—and cuts high‑concurrency inference latency by 70 %.
4. Lingqu 2.0 Interconnect: 8192‑Card Full Mesh
The 2.0 interconnect provides 2 TB/s bandwidth and reduces single‑hop latency from 2 µs to 200 ns (10× improvement). A full‑optical Mesh topology boosts rack‑to‑rack bandwidth tenfold, with cross‑rack latency of only 7 µs, enabling 8192‑card full‑mesh clusters. The Atlas 950 supernode supports 8192 direct‑connected cards, achieving 16.3 PB total bandwidth—62 × Nvidia NVLink—and easily handles training of trillion‑parameter models.
5. Breaking the Barrier: Autonomous AI Compute Ecosystem
Beyond the chip, Ascend 950 serves as the core piece of Huawei's end‑to‑end autonomous AI stack, offering 100 % self‑controlled hardware, memory, interconnect, and software toolchain, eliminating supply‑risk and lock‑in. Cost is only one‑quarter of Nvidia H2 while delivering superior performance. The ecosystem spans native large‑model support, domestic servers, and operating systems, forming a complete "chip‑server‑model‑application" chain.
Conclusion
The Ascend 950 combines precise, efficient, autonomous, and open design: one‑chip dual‑structure for scenario‑specific optimization, dual‑mode SIMD/SIMT flexibility, FP4 low‑precision efficiency, and Lingqu 2.0 interconnect for massive clusters. It positions itself not as a follower but as a definition‑setter for Chinese AI chips, providing a robust compute foundation for the next wave of trillion‑parameter models.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Architects' Tech Alliance
Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
