Artificial Intelligence 13 min read

How Huawei Ascend 910 Redefines AI Training Performance

The Huawei Ascend 910 AI processor, built on the Da Vinci architecture with 7nm+ EUV technology, delivers 256 TFLOPS FP16 and 512 TOPS INT8 performance, superior energy efficiency, and a full-stack software ecosystem, making it ideal for large‑scale AI training, HPC, and cloud AI services.

Architects' Tech Alliance

Aug 10, 2025

How Huawei Ascend 910 Redefines AI Training Performance

1. Product Overview

1.1 Positioning

Ascend 910 is the core compute engine of Huawei's full‑stack AI solution, targeting data‑center AI training, large‑scale distributed training systems, HPC‑deep learning convergence, and cloud AI acceleration platforms.

1.2 Key Features

Ultra‑high compute density: 32 Da Vinci cores delivering 256 TFLOPS FP16.

Excellent energy efficiency: Measured power 310W (design 350W).

Full‑scenario support: Deep integration with MindSpore, supporting end‑edge‑cloud unified architecture.

Advanced process: 7nm+ EUV manufacturing.

Security and trust: Built‑in model protection and privacy computing.

2. Technical Specifications

Architecture: Da Vinci (Da Vinci) architecture.

Process technology: 7nm+ EUV.

Compute precision: FP16 256 TFLOPS / INT8 512 TOPS.

Core count: 32 Da Vinci cores.

Power consumption: Design 350W, measured 310W.

Video decoding: 128‑channel full‑HD (H.264/265) decoder.

Interconnects: HCCS (240Gbps), PCIe, RoCE.

Package size: To be announced.

2.2 Compute Architecture

Ascend 910 adopts the innovative Da Vinci 3D Cube architecture , comprising:

3D Cube matrix multiplication unit: Performs 4096 multiply‑add operations per cycle; 32 Cube engines work in parallel, providing 256 TFLOPS.

Vector compute unit: Supports rich custom instructions for non‑matrix workloads.

Scalar compute unit: Functions like a lightweight CPU core for control flow and basic arithmetic.

This heterogeneous architecture enables efficient task partitioning, allowing Ascend 910 to autonomously handle the entire AI training workflow with minimal host interaction.

3. Performance

3.1 Benchmarks

ResNet‑50 training: Nearly 2× speedup over mainstream single‑GPU TensorFlow (965 → 1802 images/s).

Compute efficiency: Achieves design‑specified compute while consuming less power.

Compute density: Significantly surpasses NVIDIA Tesla V100 and Google TPU v3.

3.2 Cluster Performance

Huawei’s Ascend cluster built with Ascend 910 includes:

1024 Ascend 910 chips per cluster.

Total compute reaches 256 Peta‑FLOPS.

Outperforms NVIDIA DGX‑2 and Google TPU clusters.

4. Software Ecosystem

4.1 Full‑scenario AI Framework

Ascend 910 tightly integrates with Huawei’s self‑developed MindSpore framework, offering:

Improved development efficiency: 20% reduction in core code, 50% overall efficiency gain.

Automatic differentiation: Source‑to‑Source approach surpassing traditional graph optimization.

Distributed training: Automatic multi‑node mixed parallelism without manual model partitioning.

Privacy protection: Gradient/model information sharing instead of raw data.

4.2 Operator Library and Toolchain

CANN operator library: High‑performance AI operators, boosting development efficiency threefold.

TensorEngine: Unified DSL interface for automatic operator optimization and generation.

ModelArts: Machine‑learning PaaS platform handling over 4,000 daily training jobs.

5. Application Scenarios

Large‑scale model training: Supports trillion‑parameter models, suitable for NLP, CV, and frontier AI research.

Cloud AI services: Powers Huawei Cloud EI base compute, offering 59 AI services and 159 functions.

Industry intelligence: Medical imaging analysis, financial risk modeling, industrial quality inspection, etc.

Scientific computing: Molecular dynamics, climate prediction, and other HPC workloads.

6. Product Roadmap

1. First‑generation Ascend (2018‑2020)

Ascend 310: Edge inference chip, 12nm, 16 TOPS INT8, 8W power.

Ascend 910: Data‑center training chip, 7nm, 256 TFLOPS FP16, 310W, full‑stack AI ecosystem.

2. Second‑generation Ascend (2021‑2023)

Ascend 910B: Optimized 7nm+ EUV, 376 TFLOPS FP16, enhanced large‑model training.

Ascend 310B: Edge upgrade supporting multimodal inference with MindSpore Lite.

3. Third‑generation Ascend (2024‑2025)

Ascend 910C: CloudMatrix 384 super‑node cluster, 384 chips per node, >3 TB/s memory bandwidth, enabling trillion‑parameter training.

Ascend 320: Next‑gen edge chip, 5nm, 50% better energy efficiency, supporting end‑edge‑cloud collaborative inference.

4. Future Planning (2026+)

Ascend 920: Anticipated 3nm process, >1 PFLOPS FP16, FP8 precision, dynamic sparsity, supporting MoE large models.

7. Technical Advantages Summary

Leading compute: 256 TFLOPS FP16, 50‑100% ahead of contemporaries.

Superior energy efficiency: 310W power, best‑in‑class efficiency.

Architectural innovation: 3D Cube design for ultra‑high density.

Full‑stack co‑optimization: Deep integration with MindSpore.

Broad scenario coverage: Supports cloud to edge AI deployments.

8. Terminology

8.1 Glossary

Da Vinci architecture: Huawei’s proprietary heterogeneous AI compute architecture.

3D Cube: A three‑dimensional compute unit optimized for matrix operations.

MindSpore: Huawei’s full‑scenario AI framework.

CANN: Huawei AI operator library.

8.2 Test Environment

Test platform: Huawei Atlas 900 AI training cluster.

Comparison system: NVIDIA DGX‑2 with Tesla V100.

Benchmark models: ResNet‑50, Transformer, etc.

Huawei MindSpore AI processor Ascend 910 Da Vinci architecture high-performance computing

Written by

Architects' Tech Alliance

Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.