How Huawei’s Ascend Architecture Redefines AI Acceleration
This article examines Huawei's Ascend AI accelerator architecture, detailing its heterogeneous compute units, memory hierarchy, task scheduling, programming model, and chip variants, while also discussing future challenges and the ecosystem needed for widespread AI deployment.
Introduction
With the rapid development of artificial intelligence, deep neural networks (DNN) have become essential in many fields such as image recognition, speech processing, natural language processing, and autonomous driving. Traditional CPUs and GPUs can no longer meet the growing computational demands, leading to the emergence of dedicated AI accelerators, especially neural processing units (NPUs). Huawei's Ascend architecture, a self‑developed AI acceleration platform, offers high performance, flexible hardware design, and strong scalability for data‑center, edge, IoT, and intelligent device scenarios.
Design Principles and Technical Innovations of Ascend Architecture
The Ascend architecture innovates across compute units, memory hierarchy, task scheduling, and deep integration with the software stack.
Heterogeneous Compute Units and DaVinci Architecture
Ascend NPU adopts the DaVinci architecture, integrating scalar, vector, and cube compute units to maximize DNN throughput and overcome bottlenecks of traditional designs.
Scalar Compute Unit
The scalar unit resembles a classic RISC integer ALU, handling control‑flow operations and simple arithmetic (add, subtract, multiply). Although its raw compute power is limited, it manages task scheduling and control logic efficiently.
Vector Compute Unit
The vector unit functions like a SIMD engine in CPUs/GPUs, supporting high‑performance computing and computer‑vision workloads. It executes most DNN operations (e.g., normalization, activation) but can be limited by bandwidth between high‑density ALUs and local memory.
Cube Compute Unit
To overcome vector‑unit limitations, Ascend introduces 2D and 3D (cube) compute units for GEMM acceleration. A cube unit contains 4096 multipliers and 4096 accumulators, reusing each operand 16 times, reducing data movement energy to 1/16 of the vector unit and greatly improving data reuse for large‑scale matrix operations such as convolutions and GEMM.
Memory Architecture and Data Path Optimization
The memory system uses a multi‑level cache hierarchy (L0, L1) and high‑bandwidth memory (HBM) to alleviate bandwidth bottlenecks and boost data transfer efficiency for large AI workloads.
Memory Transfer Engine (MTE)
MTE handles data movement between memory levels, providing compression/decompression, matrix transformations, and image‑to‑column conversion to improve bandwidth utilization and reduce latency.
High Bandwidth Memory (HBM)
HBM delivers far greater bandwidth than conventional DRAM, crucial for large‑scale training and inference where memory access often dominates performance.
Task Scheduling and Communication Queues
Communication queues coordinate data flow and synchronization among compute units, storing pending data packets and ensuring ordered execution. They also support dynamic priority control to adapt to diverse workloads.
Programming Model
Ascend C follows the SPMD (Single‑Program Multiple‑Data) paradigm. Input data are split into shards, each processed in parallel by multiple AI cores. Each core runs the same instruction stream but distinguishes itself via a unique block_idx.
Software Stack Support and Compiler Optimizations
The architecture integrates with major AI frameworks such as TensorFlow, PyTorch, and Huawei’s MindSpore. Its compiler performs graph optimizations, operator fusion, and data layout transformations to maximize performance and reduce memory‑bandwidth mismatches.
Ascend Chip Series
Ascend 310
Targeted at edge and IoT devices, Ascend 310 features 8 DaVinci AI cores and 4 ARM A73 cores, delivering ~16 TFLOPS FP16 with low power consumption, suitable for image and speech recognition on smart cameras, home devices, and automotive edge scenarios.
Ascend 910
The flagship data‑center chip, Ascend 910, provides 256 TFLOPS FP16, 32 DaVinci AI cores, and high‑bandwidth memory, enabling large‑scale model training and high‑throughput inference in cloud environments.
Future Development and Challenges
Despite its achievements, Ascend faces challenges in global market penetration and ecosystem maturity, competing with NVIDIA and needing deeper integration with mainstream AI frameworks. Ongoing innovation in hardware and software is required to support multimodal AI and reinforcement learning workloads.
Conclusion
Huawei’s Ascend architecture, with its innovative compute units, memory system, and software co‑design, has become a powerful AI accelerator. Continued advancements are expected to expand its global impact and further drive AI technology adoption.
Architects' Tech Alliance
Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
