Huawei Ascend AI Chip Detailed Specs Comparison (2025‑2028 Roadmap)

The article analyzes Huawei's Ascend AI chip evolution from the 910C baseline through the 950 series' low‑precision FP8/FP4 breakthrough to the 960/970 generation’s 8 PFLOPS performance, highlighting architectural innovations, memory and interconnect upgrades, scenario‑specific models, and a cost advantage over competing solutions.

Architects' Tech Alliance
Architects' Tech Alliance
Architects' Tech Alliance
Huawei Ascend AI Chip Detailed Specs Comparison (2025‑2028 Roadmap)

Ascend series roadmap

Huawei’s Ascend AI chips follow a “one‑year‑one‑generation, compute‑doubling” strategy, moving from the general‑purpose 910C (800 TFLOPS FP16) to scenario‑specific 950 PR/DT models (FP8/FP4, 1 PFLOPS) and finally to the ultra‑large‑scale 960/970 chips (up to 8 PFLOPS FP4, 4 TB/s interconnect).

910C baseline

800 TFLOPS FP16 compute.

Dual‑chip package (910B) for CloudMatrix‑384 clusters, supporting trillion‑parameter model training.

SIMD vector pipelines; memory‑access granularity reduced to 128 bytes (from 512 bytes), improving discrete memory access efficiency by 4×.

Interconnect bandwidth 784 GB/s.

950 series – low‑precision acceleration

Native FP8 and FP4 support; peak compute 1 PFLOPS (FP8) while maintaining FP16‑level accuracy.

Introduces proprietary HiBL 1.0 (4 TB/s bandwidth, 288 GB capacity) and HiZQ 2.0 memory solutions.

Two variants: 950 PR (prefill‑oriented, 128 GB memory) and 950 DT (decode‑oriented, 144 GB memory).

SIMT programming model; vector compute share increased by 30 % and task‑scheduling latency reduced by 50 % through full‑stack optimization (ASIC → MindSpore → CANN → ModelArts).

Pricing ≈ ¥10 k per card, about 30 % lower than comparable foreign products.

960/970 – massive scaling

Follow “compute doubles each generation” principle.

960 delivers 8 PFLOPS (FP4) and 4 PFLOPS (FP8); 970 reaches 8 PFLOPS (FP4) with 4 TB/s interconnect.

Memory capacity doubled to 288 GB and bandwidth up to 14.4 TB/s, eliminating memory bottlenecks for trillion‑parameter models.

Energy‑efficiency improvement ≈ 30 % over 910C (N+3 process).

Supports dynamic sparse computation and Mixture‑of‑Experts (MoE) architectures.

Interconnect upgrades

Bandwidth increased from 784 GB/s (910C) to 4 TB/s (970), a 5× rise that enables linear scaling of multi‑chip clusters and surpasses NVIDIA NVLink performance projected for 2027.

Full‑stack optimization

From ASIC (Ascend Core) through the MindSpore framework, CANN operator library, to the ModelArts application platform, vector compute proportion is boosted by 30 % and task‑scheduling latency cut by 50 %.

Domestic memory and ecosystem impact

Self‑developed HBM, HiBL and HiZQ memory replace foreign solutions, reducing reliance on external suppliers. Compatibility with both CANN and CUDA ecosystems lowers migration cost.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

PerformancearchitecturebenchmarkFP8AI chipHuaweiHBMAscend
Architects' Tech Alliance
Written by

Architects' Tech Alliance

Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.