Artificial Intelligence 10 min read

Why DeepSeek V4 Can Run on Huawei Ascend: A Deep Technical Breakdown

The article analyzes why most open‑source large models cannot run on Huawei Ascend NPU, detailing the CUDA‑centric ecosystem, Ascend's CANN stack, three core technical hurdles, and the deep collaboration and tooling that enabled DeepSeek V4’s successful adaptation.

CodeTrend

Apr 26, 2026

Why DeepSeek V4 Can Run on Huawei Ascend: A Deep Technical Breakdown

Running a PyTorch‑based large model directly on Huawei Ascend NPU typically triggers a cascade of errors—from low‑level operators to communication libraries and memory management—because the software ecosystem differs fundamentally from NVIDIA’s CUDA.

Fundamental issue: CUDA is an ecosystem – CUDA comprises a full parallel‑computing platform (programming language, compiler, runtime, math and communication libraries) that has been refined for over a decade by thousands of engineers. Over 90% of open‑source AI projects target CUDA, creating strong developer lock‑in; a new hardware platform that cannot reuse CUDA forces developers to rewrite the entire stack.

Ascend software stack – Huawei’s counterpart is CANN (Compute Architecture for Neural Networks), which provides the low‑level operator management, memory scheduling, and compilation optimizations for Ascend chips. CANN also relies on deep‑learning frameworks such as MindSpore, while PyTorch and TensorFlow remain the dominant frameworks for CUDA.

Three concrete technical hurdles :

Operator alignment – Numerical precision is critical for large‑model training; the same matrix‑multiply operator behaves differently on CUDA and CANN, causing divergent convergence curves. DeepSeek spent months rewriting low‑level code and repeatedly testing for bit‑wise identical results.

Distributed communication library replacement – NVIDIA’s NCCL is highly tuned for NVLink, whereas Ascend uses HCCL with different bandwidth and latency characteristics. The collective communication patterns (AllReduce, AllGather, ReduceScatter) require a redesign of topology, pipeline scheduling, and overlap strategies.

Memory hierarchy differences – GPU memory follows SRAM → L2 → HBM → Host, while Ascend NPU has a distinct hierarchy. Managing model memory allocation, KV‑Cache handling, and on‑chip memory planning for DeepSeek V4’s sliding‑window attention and heterogeneous KV pools demands a complete rewrite.

Why DeepSeek succeeded while others struggle :

Deep collaboration – Huawei and DeepSeek co‑designed the model, embedding Ascend‑specific hardware considerations from the outset, as reflected in a design specification that ties communication bandwidth (GBps) to compute throughput (TFLOP/s).

TileLang framework – DeepSeek introduced TileLang, a hardware‑agnostic operator description language, and used the Z3 SMT solver for automatic verification. TileLang‑Ascend provides Ascend‑specific optimizations while remaining abstracted from CUDA or CANN.

CANN compatibility – Huawei’s CANN now offers about 95% CUDA API compatibility, lowering migration effort, though the remaining 5% of high‑value hotspots (sparse attention, communication, low‑precision quantization) still need hand‑tuned operators.

Engineering scale – Public reports indicate DeepSeek spent several months rewriting low‑level code to adapt to Ascend, a resource investment only a few teams can afford.

Current status – Inference on Ascend is production‑ready; training remains CUDA‑dominant, though Huawei has demonstrated 1.35‑trillion‑parameter and 718‑billion‑parameter models trained on Ascend 910. DeepSeek V4‑Pro’s full pre‑training on Ascend has not been fully disclosed, but continued pre‑training and inference are confirmed feasible.

Conclusion and outlook – Most open‑source models cannot migrate because the CUDA ecosystem binds the entire engineering chain, and moving to CANN requires months of low‑level rewrites and close chip‑vendor collaboration. The remaining 5% incompatibility represents the most performance‑critical paths. Future AI code‑generation tools and broader adoption of intermediate languages like TileLang could reduce migration costs, and higher CANN‑CUDA compatibility may eventually shift ecosystem inertia.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Distributed Training CANN Huawei Ascend TileLang DeepSeek V4 AI model porting operator alignment

Written by

CodeTrend

Capture the daily pulse of global open-source tech. Real-time tracking of GitHub Trending and curated selections of the hottest projects worldwide, including C++, Python and other verticals. Avoid information overload and keep tech trends within reach.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.