Industry Insights 10 min read

How DeepSeek V4 Uses Huawei Ascend 950PR to Outperform Nvidia H20 by 2.9×

The article analyzes DeepSeek V4's migration to Huawei's Ascend 950PR chip and CANN framework, detailing three hardware‑level innovations, the CUDA‑to‑CANN transition, and the resulting 35× inference speed boost, 2.87× performance over Nvidia H20, and dramatic cost reductions for trillion‑parameter models.

Architects' Tech Alliance
Architects' Tech Alliance
Architects' Tech Alliance
How DeepSeek V4 Uses Huawei Ascend 950PR to Outperform Nvidia H20 by 2.9×

Background

In early April, DeepSeek announced its next‑generation flagship model, DeepSeek V4, which runs entirely on Huawei's Ascend 950PR processor and shifts its software stack from CUDA to the CANN framework.

Three Hardware‑Level Innovations (the "Three Knives")

1. Native FP4 Precision

The 950PR chip provides 1.56 P FP4 compute, delivering 2.87× the inference performance of Nvidia's H20 special‑edition chip. By using a proprietary encoding technique, the chip compresses FP16‑level models (e.g., a 700 billion‑parameter model requiring 140 GB of memory) down to FP4, reducing memory consumption to 35 GB and allowing a single card to load models that previously needed three H20 GPUs.

2. HiBL High‑Bandwidth Memory

DeepSeek V4 leverages Huawei's HiBL 1.0 architecture, offering 1.4 TB/s memory bandwidth and 112 GB of HBM, 16 GB more than H20. This eliminates the traditional memory‑bandwidth bottleneck and speeds up the Prefill stage of 256 K‑token context inference by 60%.

3. Fine‑Granularity Data Transfer

The chip reduces the minimum memory transaction size from 512 bytes to 128 bytes, a four‑fold improvement for fragmented data scenarios such as real‑time dialogue or short‑video analysis, boosting small‑packet processing efficiency by 4× and increasing multimodal utilization by over 30%.

Migration from CUDA to CANN

The transition is not a simple compatibility layer; DeepSeek performed a native rewrite. CANN’s Operator Developer toolchain enables "define‑and‑deploy" via a five‑line YAML configuration, while complex logic can be expressed in Python, cutting operator development time from 3–5 days to hours.

The core challenge is the architectural difference: Nvidia’s SIMT model versus Ascend’s 3D‑Cube+Vector+Scalar design. CANN 8.2’s torch_npu interface requires only three code changes—importing the NPU backend, swapping device identifiers, and using an optimized optimizer—to complete the migration, with optional automatic redirection of CUDA calls.

DeepSeek‑Specific Optimizations in CANN

CANN provides deep optimizations for DeepSeek V4’s MoE architecture, adding dedicated operators and integrating FlashAttention, which reduces attention memory complexity from O(n²) to O(n) and speeds up execution by 30‑50%.

Benchmark results show DeepSeek V4 scoring 96.1 on MATH‑500 and 93.5 on HumanEval+, thanks to CANN’s automatic operator fusion and scheduling.

Soft‑Hardware Co‑Design for Trillion‑Parameter Models

DeepSeek V4’s 130 B activation parameters can be inferred at only 1/70 the cost of GPT‑4, thanks to the 950PR’s 128‑byte fine‑grained memory architecture and CANN’s dynamic shape support, which keeps 256 K‑token context latency acceptable.

In multimodal tasks, the model processes image‑plus‑text inputs 35× faster than CUDA‑based platforms while reducing power consumption by 40%.

Industry Implications

This migration demonstrates that Chinese AI hardware and software ecosystems can achieve native performance without relying on compatibility modes, signaling a shift from single‑product parameter races to full‑stack ecosystem efficiency competitions.

The rapid maturation of the CANN ecosystem—covering drivers, SDKs, and compatibility with major AI frameworks—lowers the entry barrier for developers and accelerates the adoption of domestic AI chips in large‑scale models.

DeepSeekPerformance AnalysisModel InferenceAI hardwareHuawei AscendCANN framework
Architects' Tech Alliance
Written by

Architects' Tech Alliance

Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.