How Huawei’s Ascend AI Chips Evolve: From 910C to 970 – Architecture, Performance, and Market Impact

The article analyzes Huawei’s Ascend AI chip roadmap, detailing the progression from the 910C baseline to the 950, 960/970 generations, highlighting compute scaling, low‑precision formats, memory and interconnect upgrades, cost advantages, and their implications for large‑model AI workloads.

Architects' Tech Alliance
Architects' Tech Alliance
Architects' Tech Alliance
How Huawei’s Ascend AI Chips Evolve: From 910C to 970 – Architecture, Performance, and Market Impact

Core Parameter Overview

Huawei’s Ascend series follows a "one‑year‑one‑generation, compute‑doubling" logic, moving from the general‑purpose 910C baseline to specialized 950PR/DT models and finally to the ultra‑scale 960/970 chips, covering the full spectrum of training and inference needs.

1. Compute Evolution

910C : 800 TFLOPS FP16 performance, dual‑chip (910B) package, supports CloudMatrix‑384 super‑node clusters for trillion‑parameter training.

950 Series : Introduces FP8/FP4 low‑precision formats, boosting peak compute to 1 PFLOPS (FP8) while maintaining near‑FP16 accuracy, addressing the training‑accuracy trade‑off.

960/970 : Adheres to the "compute doubles each generation" principle, delivering 8 PFLOPS (FP4) and 30% higher vector‑compute share, with a 30% energy‑efficiency gain over 910C.

2. Architectural Innovations

Both 910C and later models retain SIMD vector cores for high‑efficiency vector processing. Starting with the 950 series, SIMT support is added, enabling flexible programming models for diverse AI scenarios. Memory‑access granularity shrinks from 512 bytes to 128 bytes, improving discrete memory access efficiency by 4×.

3. Memory & Interconnect

The 950 series adopts Huawei‑designed HBM solutions (HiBL 1.0) and HiZQ 2.0, providing 4 TB/s bandwidth and 288 GB capacity, eliminating reliance on external memory vendors. The 960/970 chips double memory capacity to 288 GB and raise bandwidth to 14.4 TB/s, fully supporting trillion‑parameter models and MoE architectures.

Interconnect bandwidth scales from 784 GB/s (910C) to 4 TB/s (970), a 5× increase that enables linear scaling of multi‑chip clusters, surpassing NVIDIA’s NVL576 performance projected for 2027.

4. Technical Highlights & Industry Significance

End‑to‑end optimization across chip, MindSpore framework, CANN operator library, and ModelArts platform raises vector‑compute share by 30% and reduces task‑scheduling latency by 50%.

Domestic breakthroughs: N+2/N+3 process autonomy, self‑developed HBM, HiBL/HiZQ storage technologies fill Chinese market gaps, and the custom interconnect architecture challenges NVIDIA’s NVLink monopoly.

Cost advantage: 950PR priced around ¥10 k per card (≈ ¥8 k for key customers), roughly 30% cheaper than comparable competitors, with packaging improvements further lowering large‑scale deployment costs.

These advancements position Huawei’s Ascend chips as a competitive, cost‑effective alternative in the AI hardware market, supporting both training and inference workloads across diverse industry scenarios.

PerformanceHuaweiAI chipsAscendindustry insighthardware analysis
Architects' Tech Alliance
Written by

Architects' Tech Alliance

Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.