Huawei Ascend AI Chip Detailed Specs Comparison (2025‑2028 Roadmap)
The article analyzes Huawei's Ascend AI chip evolution from the 910C baseline through the 950 series' low‑precision FP8/FP4 breakthrough to the 960/970 generation’s 8 PFLOPS performance, highlighting architectural innovations, memory and interconnect upgrades, scenario‑specific models, and a cost advantage over competing solutions.
Ascend series roadmap
Huawei’s Ascend AI chips follow a “one‑year‑one‑generation, compute‑doubling” strategy, moving from the general‑purpose 910C (800 TFLOPS FP16) to scenario‑specific 950 PR/DT models (FP8/FP4, 1 PFLOPS) and finally to the ultra‑large‑scale 960/970 chips (up to 8 PFLOPS FP4, 4 TB/s interconnect).
910C baseline
800 TFLOPS FP16 compute.
Dual‑chip package (910B) for CloudMatrix‑384 clusters, supporting trillion‑parameter model training.
SIMD vector pipelines; memory‑access granularity reduced to 128 bytes (from 512 bytes), improving discrete memory access efficiency by 4×.
Interconnect bandwidth 784 GB/s.
950 series – low‑precision acceleration
Native FP8 and FP4 support; peak compute 1 PFLOPS (FP8) while maintaining FP16‑level accuracy.
Introduces proprietary HiBL 1.0 (4 TB/s bandwidth, 288 GB capacity) and HiZQ 2.0 memory solutions.
Two variants: 950 PR (prefill‑oriented, 128 GB memory) and 950 DT (decode‑oriented, 144 GB memory).
SIMT programming model; vector compute share increased by 30 % and task‑scheduling latency reduced by 50 % through full‑stack optimization (ASIC → MindSpore → CANN → ModelArts).
Pricing ≈ ¥10 k per card, about 30 % lower than comparable foreign products.
960/970 – massive scaling
Follow “compute doubles each generation” principle.
960 delivers 8 PFLOPS (FP4) and 4 PFLOPS (FP8); 970 reaches 8 PFLOPS (FP4) with 4 TB/s interconnect.
Memory capacity doubled to 288 GB and bandwidth up to 14.4 TB/s, eliminating memory bottlenecks for trillion‑parameter models.
Energy‑efficiency improvement ≈ 30 % over 910C (N+3 process).
Supports dynamic sparse computation and Mixture‑of‑Experts (MoE) architectures.
Interconnect upgrades
Bandwidth increased from 784 GB/s (910C) to 4 TB/s (970), a 5× rise that enables linear scaling of multi‑chip clusters and surpasses NVIDIA NVLink performance projected for 2027.
Full‑stack optimization
From ASIC (Ascend Core) through the MindSpore framework, CANN operator library, to the ModelArts application platform, vector compute proportion is boosted by 30 % and task‑scheduling latency cut by 50 %.
Domestic memory and ecosystem impact
Self‑developed HBM, HiBL and HiZQ memory replace foreign solutions, reducing reliance on external suppliers. Compatibility with both CANN and CUDA ecosystems lowers migration cost.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Architects' Tech Alliance
Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
