Inside Huawei Ascend 910: Architecture, Performance, and Future Roadmap

The article provides a detailed technical analysis of Huawei's Ascend 910 AI processor, covering its Da Vinci architecture, hardware specifications, benchmark results, software ecosystem, application scenarios, and product roadmap, while also clarifying key terminology for readers.

Architects' Tech Alliance
Architects' Tech Alliance
Architects' Tech Alliance
Inside Huawei Ascend 910: Architecture, Performance, and Future Roadmap

Product Overview

Huawei Ascend 910 is a high‑performance AI processor based on the self‑designed Da Vinci architecture, fabricated with 7nm+ EUV. It targets data‑center AI training, large‑scale distributed training, HPC‑AI convergence, and cloud AI acceleration.

Key Features

32 Da Vinci cores delivering 256 TFLOPS FP16 (512 TOPS INT8)

Actual power consumption 310 W (design 350 W)

7nm+ EUV process for high transistor density

Built‑in model protection and privacy‑preserving computation

Deep integration with MindSpore for end‑edge‑cloud unified stack

Technical Specifications

Architecture: Da Vinci (3D‑Cube)

Process: 7nm+ EUV

Compute precision: FP16 256 TFLOPS / INT8 512 TOPS

Cores: 32 Da Vinci cores

Power: Design 350 W, measured 310 W

Video decode: 128‑channel full‑HD (H.264/H.265)

Interconnect: HCCS 240 Gbps, PCIe, RoCE

Compute Architecture

3D‑Cube Matrix‑Multiply Unit

Per‑cycle 4096 multiply‑add operations

32 Cube engines work in parallel, delivering 256 TFLOPS

Two orders of magnitude performance improvement over CPU/GPU for matrix ops

Vector Unit

Custom compute instructions for element‑wise and non‑matrix workloads

Scalar Unit

Lightweight CPU‑like core for control flow and basic arithmetic

Performance

Benchmark Results

ResNet‑50 training: ~1802 images/s, ~2× faster than a mainstream GPU+TensorFlow setup (965 images/s)

Compute efficiency matches advertised FP16 performance while staying under power budget

Compute density exceeds NVIDIA Tesla V100 and Google TPU v3

Cluster Performance

One Ascend cluster contains 1024 Ascend 910 chips

Total compute reaches 256 PFLOPS (peta‑FLOPS)

Outperforms NVIDIA DGX‑2 and Google TPU clusters in throughput and energy efficiency

Software Ecosystem

Full‑Stack AI Framework

Deep integration with MindSpore; developer code reduction ~20 % and overall efficiency gain ~50 %

Automatic source‑to‑source differentiation and distributed training support

Operator Library & Toolchain

CANN operator library provides high‑performance AI operators (productivity boost ~3×)

TensorEngine offers a unified DSL for automatic operator optimization and generation

ModelArts PaaS platform handles >4 000 daily training jobs

Application Scenarios

Large‑scale model training (trillion‑parameter models, NLP, CV)

Cloud AI services (Huawei Cloud EI platform, 59 AI services, 159 functions)

Industry AI such as medical imaging analysis, financial risk modeling, industrial quality inspection

Scientific computing (molecular dynamics, climate prediction, other HPC workloads)

Product Roadmap

First generation (2018‑2020): Ascend 310 (edge inference, 12 nm, 16 TOPS INT8, 8 W) and Ascend 910 (data‑center training, 7 nm, 256 TFLOPS FP16, 310 W)

Second generation (2021‑2023): Ascend 910B (7 nm+ EUV, 376 TFLOPS FP16) and Ascend 310B (multimodal edge inference, MindSpore Lite)

Third generation (2024‑2025): Ascend 910C (384‑chip node, >3 TB/s memory bandwidth, supports trillion‑parameter models) and Ascend 320 (next‑gen edge chip, 5 nm, 50 % better energy efficiency)

Future (2026+): Ascend 920 (3 nm, target >1 PFLOPS FP16, FP8 support, dynamic sparsity, MoE‑friendly)

Technical Advantages Summary

Leading compute density: 256 TFLOPS FP16

Best‑in‑class energy efficiency: 310 W for full performance

Innovative 3D‑Cube architecture delivering ultra‑high matrix‑multiply throughput

Full‑stack software co‑optimization with MindSpore

Comprehensive scenario coverage from cloud to edge

Terminology

Da Vinci architecture: Huawei’s heterogeneous AI compute architecture

3D Cube: Dedicated 3‑dimensional matrix‑multiply engine

MindSpore: Huawei’s full‑stack AI framework

CANN: Huawei AI operator library

References

https://mp.weixin.qq.com/s?__biz=MzAxNzU3NjcxOA==∣=2650759181&idx=1&sn=4c1cd87ea5b8f4f10d48e4c3b0943c0b&scene=21#wechat_redirect

https://mp.weixin.qq.com/s?__biz=MzAxNzU3NjcxOA==∣=2650759048&idx=1&sn=56b892974c8ac18955e14c040138fa2f&scene=21#wechat_redirect

https://mp.weixin.qq.com/s?__biz=MzAxNzU3NjcxOA==∣=2650759697&idx=1&sn=d4adefbcb7ad09e5e3805068d2844485&scene=21#wechat_redirect

PerformancehardwareAI acceleratorHuaweiMindSporeDaVinci architectureAscend910
Architects' Tech Alliance
Written by

Architects' Tech Alliance

Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.