Inside Huawei Ascend: How Its Heterogeneous Architecture Powers Modern AI Workloads

This article provides an in‑depth technical analysis of Huawei’s Ascend AI accelerator architecture, detailing its heterogeneous compute units, memory hierarchy, task scheduling, programming model, compiler optimizations, and the capabilities of the Ascend 310 and 910 chips, while also discussing future challenges and market competition.

Architects' Tech Alliance
Architects' Tech Alliance
Architects' Tech Alliance
Inside Huawei Ascend: How Its Heterogeneous Architecture Powers Modern AI Workloads

Introduction

Deep neural networks (DNN) drive modern AI workloads such as image recognition, speech processing, natural‑language processing, and autonomous driving. Conventional CPUs and GPUs cannot keep up with the massive compute demand, leading to dedicated AI accelerators. Huawei’s Ascend architecture is a self‑designed AI accelerator platform that combines heterogeneous compute units, a multi‑level memory hierarchy, and tight software integration to deliver high compute efficiency for data‑center, cloud, edge, and IoT scenarios.

Design Principles and Technical Innovations

Scalar Compute Unit

The scalar unit is a classic RISC‑style integer ALU that handles control‑flow operations and simple arithmetic (add, sub, mul). Although its raw compute throughput is modest, it orchestrates task scheduling and control logic for the higher‑level compute units.

Scalar Compute Unit Diagram
Scalar Compute Unit Diagram

Vector Compute Unit

The vector unit operates as a SIMD engine similar to those in CPUs and GPUs, executing the majority of arithmetic in inference and training (e.g., normalization, activation). Its bandwidth can become a bottleneck because data reuse is limited, constraining the data path between dense ALUs and local memory.

Vector Compute Unit Diagram
Vector Compute Unit Diagram

Cube Compute Unit

To overcome the vector‑unit bandwidth limitation, Ascend introduces a 3‑D cube compute unit dedicated to DNN workloads. Each cube contains 4,096 multipliers and 4,096 accumulators that operate on 16×16×16 matrices. Operand reuse is 16× higher than in the vector unit, reducing energy per operation and dramatically increasing throughput for GEMM‑heavy tasks such as convolutions.

Cube Compute Unit Diagram
Cube Compute Unit Diagram

Memory Architecture and Data‑Path Optimization

The memory subsystem adopts a multi‑level cache hierarchy (L0, L1) together with high‑bandwidth memory (HBM). This reduces bandwidth bottlenecks and improves data‑transfer efficiency for large‑scale AI tasks.

Ascend Memory Architecture
Ascend Memory Architecture

Memory Transfer Engine (MTE)

MTE moves data across memory tiers and supports compression/decompression, matrix transformations, and the img2col operation, thereby reducing latency and improving bandwidth utilization.

High‑Bandwidth Memory (HBM)

HBM provides far greater bandwidth than conventional DRAM, enabling rapid data access for compute units during training and inference on massive datasets.

Task Scheduling and Communication Queues

Communication queues orchestrate data movement and synchronization between compute units. They store pending data packets, enforce ordering, and allow dynamic priority adjustments, preventing resource conflicts and optimizing overall performance.

Communication Queue Diagram
Communication Queue Diagram

Programming Model

Ascend C follows the Single‑Program Multiple‑Data (SPMD) paradigm. Input data are split into shards; each shard is processed by a separate AI Core that shares the same instruction stream but is distinguished by a unique block_idx identifier.

SPMD Execution Model
SPMD Execution Model
AI Core Parallel Processing
AI Core Parallel Processing

Software Stack Support and Compiler Optimizations

The Ascend platform integrates with major AI frameworks (TensorFlow, PyTorch, MindSpore). Its compiler performs graph optimizations, operator fusion, and data‑layout transformations, aligning compute and memory bandwidth to achieve higher execution efficiency.

Ascend Chip Series

Ascend 310

Targeted at edge and IoT devices, Ascend 310 integrates eight DaVinci AI cores and four ARM A73 cores, delivering ~16 TFLOPS (FP16) with low power consumption. It is suited for image and speech recognition on smart‑home, surveillance, and automotive edge scenarios.

Ascend 910

The flagship Ascend 910 is designed for data‑center and cloud environments. It houses 32 DaVinci AI cores, offers 256 TFLOPS (FP16), and leverages HBM to support large‑scale model training and high‑throughput inference.

Future Development and Challenges

Despite strong performance, Ascend faces challenges in expanding globally and maturing its software ecosystem. Competition from NVIDIA and the need for broader framework compatibility remain key hurdles. Continued innovation in hardware and software will be required to support multimodal AI and reinforcement‑learning workloads.

Conclusion

Huawei’s Ascend architecture, with its heterogeneous compute units, advanced memory system, and tightly coupled software stack, represents a significant force in AI acceleration. Ongoing advancements are expected to increase its global impact and further drive AI technology adoption.

heterogeneous computingAI acceleratorAI hardwareHBMHuawei AscendNPU architecture
Architects' Tech Alliance
Written by

Architects' Tech Alliance

Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.