What Makes Huawei’s Ascend 920 AI Chip a Game-Changer? Deep Technical Breakdown

An in‑depth analysis of Huawei’s third‑generation Ascend 920 AI processor reveals its 6 nm process, 64 Da Vinci cores, advanced Cube Unit matrix engine, HBM‑PIM memory‑compute integration, high‑speed interconnects, performance benchmarks versus Nvidia H20, and the challenges and future directions for AI hardware.

Architects' Tech Alliance
Architects' Tech Alliance
Architects' Tech Alliance
What Makes Huawei’s Ascend 920 AI Chip a Game-Changer? Deep Technical Breakdown

Overview

The Ascend 920 is Huawei’s third‑generation AI processor, fabricated with SMIC’s 6 nm (N+3) process. It integrates 64 Da Vinci AI cores, each containing 32 matrix compute units (Cube Units), and supports mixed‑precision BF16/FP16/INT8 operations. Peak performance reaches 900 TFLOPS (BF16) and 1800 TOPS (INT8) with a memory bandwidth of 4000 GB/s, delivering a 50 % improvement over the previous Ascend 910.

Architecture Details

Cube Unit Evolution

The Cube Unit adopts a 3‑D systolic array (16×16) that enables FP16/INT8 mixed‑precision computing. Dynamic sparse computation is built into the hardware, pruning redundant connections at runtime and boosting inference performance by up to 200 % while preserving model accuracy. BF16 support further accelerates Transformer training, yielding a 30 % throughput increase for BERT‑large on a per‑kilowatt‑hour basis.

HBM‑PIM Memory‑Compute Integration

For the first time in an AI chip, Ascend 920 embeds HBM‑PIM (high‑bandwidth memory processing‑in‑memory) technology, allowing part of the compute logic to reside directly on HBM3 stacks. This “memory‑side compute” reduces data movement power consumption and improves energy efficiency by 5× in image‑segmentation tasks, cutting GPU memory usage by 37 %.

Multi‑Chip Interconnect (HCCS 2.0)

The processor uses Huawei’s proprietary HCCS 2.0 (high‑speed cache‑coherent bus) to interconnect up to four chips, achieving 480 GB/s inter‑chip bandwidth—50 % higher than the previous generation. Combined with PCIe 5.0 (128 GB/s bidirectional per card), the architecture supports large‑scale AI clusters, delivering 91 % training efficiency for ResNet‑50 on a 192‑card system, surpassing Nvidia H20’s NVLink‑based clusters (82 %).

Key Technology Breakthroughs

Dynamic Sparse Computing : A programmable Sparse Engine identifies and skips redundant neural‑network connections, halving inference latency in recommendation models without affecting AUC. MindSpore provides automatic sparse annotation tools, requiring no code changes.

Heterogeneous Computing with CANN 6.0 : The Compute Architecture for Neural Networks (CANN) 6.0 introduces operator fusion and auto‑parallelism. Operator fusion merges multiple layers into a single compute unit, reducing data movement and improving ResNet‑50 inference speed by 30 %. Auto‑parallel supports data, model, and pipeline parallelism, raising GPT‑3‑175B training GPU utilization to 92 %.

Software Ecosystem : Ascend 920 tightly integrates MindSpore 3.0, offering unified dynamic and static graph programming. Compatibility layers enable seamless migration of CUDA code (≈92 % conversion success) and direct execution of TensorFlow and PyTorch models.

Performance Measurements & Industry Applications

Benchmarking on typical AI training workloads shows the Ascend 920’s energy‑efficiency ratio at 0.39 W/TOPS (INT8), outperforming Nvidia H20’s 0.62 W/TOPS and promising up to 42 % data‑center electricity cost savings.

Performance comparison chart
Performance comparison chart

Key application scenarios include:

Intelligent driving: supports eight simultaneous 1080p video streams with sub‑50 ms decision latency.

Biopharma: when paired with AlphaFold2, single‑card protein‑structure prediction speed is three times faster than GPU‑based solutions, shortening drug‑discovery cycles by ~20 %.

Financial risk control: real‑time fraud detection processes millions of transactions per second, reducing model update latency from hours to minutes.

Challenges and Future Outlook

Ecosystem maturity: while MindSpore covers mainstream models, niche domains such as quantum‑simulation lack robust toolchains.

Manufacturing yield: SMIC’s 6 nm process yields around 78 %, still below TSMC’s 5 nm performance.

Competitive landscape: Nvidia H20 faces export restrictions, but AMD MI308 and other competitors may fill market gaps.

Future directions for Ascend 920 include deeper memory‑compute integration (e.g., MRAM), silicon‑photonic interconnects to lower multi‑chip communication power, and expanding the open‑source community to build a full AI stack around the Ascend architecture.

Performance Benchmarkhardware architectureAI processorAI chip industryHuawei Ascend 920
Architects' Tech Alliance
Written by

Architects' Tech Alliance

Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.