How Huawei’s Ascend AI Chips Evolve: From 910C to 970 – Architecture, Performance, and Market Impact
The article analyzes Huawei’s Ascend AI chip roadmap, detailing the progression from the 910C baseline to the 950, 960/970 generations, highlighting compute scaling, low‑precision formats, memory and interconnect upgrades, cost advantages, and their implications for large‑model AI workloads.
Core Parameter Overview
Huawei’s Ascend series follows a "one‑year‑one‑generation, compute‑doubling" logic, moving from the general‑purpose 910C baseline to specialized 950PR/DT models and finally to the ultra‑scale 960/970 chips, covering the full spectrum of training and inference needs.
1. Compute Evolution
910C : 800 TFLOPS FP16 performance, dual‑chip (910B) package, supports CloudMatrix‑384 super‑node clusters for trillion‑parameter training.
950 Series : Introduces FP8/FP4 low‑precision formats, boosting peak compute to 1 PFLOPS (FP8) while maintaining near‑FP16 accuracy, addressing the training‑accuracy trade‑off.
960/970 : Adheres to the "compute doubles each generation" principle, delivering 8 PFLOPS (FP4) and 30% higher vector‑compute share, with a 30% energy‑efficiency gain over 910C.
2. Architectural Innovations
Both 910C and later models retain SIMD vector cores for high‑efficiency vector processing. Starting with the 950 series, SIMT support is added, enabling flexible programming models for diverse AI scenarios. Memory‑access granularity shrinks from 512 bytes to 128 bytes, improving discrete memory access efficiency by 4×.
3. Memory & Interconnect
The 950 series adopts Huawei‑designed HBM solutions (HiBL 1.0) and HiZQ 2.0, providing 4 TB/s bandwidth and 288 GB capacity, eliminating reliance on external memory vendors. The 960/970 chips double memory capacity to 288 GB and raise bandwidth to 14.4 TB/s, fully supporting trillion‑parameter models and MoE architectures.
Interconnect bandwidth scales from 784 GB/s (910C) to 4 TB/s (970), a 5× increase that enables linear scaling of multi‑chip clusters, surpassing NVIDIA’s NVL576 performance projected for 2027.
4. Technical Highlights & Industry Significance
End‑to‑end optimization across chip, MindSpore framework, CANN operator library, and ModelArts platform raises vector‑compute share by 30% and reduces task‑scheduling latency by 50%.
Domestic breakthroughs: N+2/N+3 process autonomy, self‑developed HBM, HiBL/HiZQ storage technologies fill Chinese market gaps, and the custom interconnect architecture challenges NVIDIA’s NVLink monopoly.
Cost advantage: 950PR priced around ¥10 k per card (≈ ¥8 k for key customers), roughly 30% cheaper than comparable competitors, with packaging improvements further lowering large‑scale deployment costs.
These advancements position Huawei’s Ascend chips as a competitive, cost‑effective alternative in the AI hardware market, supporting both training and inference workloads across diverse industry scenarios.
Architects' Tech Alliance
Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
