Huawei Pangu Ultra: 135B Ascend‑Native Dense LLM Without Nvidia GPUs
Huawei's Pangu Ultra introduces a 135‑billion‑parameter dense language model trained entirely on Ascend NPUs, detailing novel stability architectures, a domain‑aware tokenizer, multi‑stage pre‑training, extensive system optimizations, and benchmark results that surpass leading models such as Llama 405B and DeepSeek‑R1.
Huawei has released the newest member of its Pangu series, Pangu Ultra , a 135 B dense large‑language model (LLM) that runs natively on Ascend NPUs, eliminating the need for Nvidia GPUs. The model demonstrates state‑of‑the‑art performance on a wide range of reasoning and language understanding benchmarks.
Model Architecture
The model uses a 94‑layer Transformer with 1 350 B parameters. Feed‑forward networks employ SwiGLU activation, and the attention layer adopts Grouped‑Query Attention (GQA) to reduce KV‑cache memory. To address training stability for such depth, the authors propose two techniques:
Depth‑scaled Sandwich‑Norm (DSSN) : adds an extra LayerNorm before each sub‑layer and scales its gamma parameter proportionally to the inverse square root of the network depth, preventing norm explosion across many residual connections.
TinyInit : a depth‑ and width‑aware initialization that sets weight standard deviation to a value inversely proportional to the square root of the product of depth and width, yielding faster loss convergence and better downstream performance.
Tokenizer
The authors build a 153 376‑token vocabulary using a domain‑aware strategy. Separate frequency analyses are performed on general Chinese, general English, code, and mathematics corpora; the resulting vocabularies are merged and deduplicated, ensuring balanced coverage across diverse tasks while maintaining compression efficiency.
Training Procedure
Pre‑training proceeds in three curriculum‑style stages on 13.2 T high‑quality tokens:
General stage (12 T tokens): broad data from books, webpages, multilingual sources.
Reasoning stage (0.8 T tokens): heavy emphasis on mathematics, science, and code (over 60 % of data).
Annealing stage (0.4 T tokens): instruction‑type data (≈20 % of tokens) with long and short chain‑of‑thought examples.
Data quality is assessed with a four‑dimensional scoring system (cleanliness, fluency, educational value, information density) using a fine‑tuned Pangu‑26B proxy model; high‑scoring samples receive higher sampling probability. Ablation with a low‑quality proxy shows that training on noisy data requires 1.6× more steps to reach comparable performance.
Long‑Sequence Extension
Through a two‑phase length‑extension curriculum, the maximum input length is increased to 128 K tokens (≈100 k English words or 170 k Chinese characters). RoPE base frequencies are tuned on a validation set matching the target length before training.
Post‑Training
After the base model, two stages are applied:
Supervised Fine‑Tuning (SFT) to acquire basic instruction‑following ability.
Reinforcement Learning from Human Feedback (RLHF) with a latency‑tolerant framework and a hybrid reward that mixes deterministic signals and model‑based evaluations, targeting mathematics, code generation, and general problem solving.
System Optimizations
Pangu Ultra is trained on a cluster of 8 192 Ascend NPUs. The following optimizations raise Model FLOPs Utilization (MFU) from the baseline 43 % to over 52 %:
Mixed Parallelism : 128‑way data parallelism, 8‑way tensor parallelism, 8‑way pipeline parallelism, combined with ZeRO and sequence parallelism. A virtual‑pipeline scheduler (6 virtual stages) reduces pipeline bubble rate from 30.45 % to 6.8 %.
MC2 (Merged Compute and Communication) : fine‑grained splitting of MatMul and tensor‑parallel communication, enabling deep pipelining of compute and communication.
NFA (NPU Fusion Attention) : an attention kernel that compresses the attention mask, avoiding explicit mask construction and using a 2048×2048 lower‑triangular matrix as a reusable mask library.
Other fused kernels : RMSNorm, SwiGLU, RoPE‑fused kernels, gradient‑accumulation fusion, and pipeline send/recv fusion.
Sub‑sequence (Context) Parallelism : improved load‑balanced chunking that assigns two chunks per device, mitigating imbalance seen in standard Megatron‑LM CP.
Memory optimizations : sharing of attention mask, actual sequence length, RoPE sin/cos, and position embeddings across virtual pipeline stages to cut redundant memory usage.
Results and Analysis
Training proceeded without any loss spikes, as shown by the smooth loss curve. Benchmark comparisons (dense models Qwen2.5‑72B, Llama 405B; MoE model DeepSeek‑V3) reveal that Pangu Ultra achieves the best scores on most tasks, with particularly strong gains over dense baselines.
After post‑training, the model surpasses DeepSeek‑R1 on AIME 2024, MATH‑500, GPQA‑Diamond, LiveCodeBench, and also retains strong performance on MMLU‑Pro and ArenaHard.
Stability experiments demonstrate that DSSN eliminates loss spikes compared with standard Pre‑LN, and yields smoother gradient‑norm curves and faster convergence. TinyInit provides a noticeable advantage over conventional small‑init when training a 135 B model on ~100 B tokens.
Conclusion
Pangu Ultra proves that large‑scale dense LLMs can be trained efficiently on fully domestic Ascend hardware, achieving competitive or superior performance to leading Nvidia‑GPU‑based models. The paper ("Pangu Ultra: Pushing the Limits of Dense Large Language Models on Ascend NPUs") and the full technical report are available at
https://github.com/pangu-tech/pangu-ultra/blob/main/pangu-ultra-report.pdf.
AIWalker
Focused on computer vision, image processing, color science, and AI algorithms; sharing hardcore tech, engineering practice, and deep insights as a diligent AI technology practitioner.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
