training stability — 7 Technical Articles

Machine Learning Algorithms & Natural Language Processing

Apr 25, 2026 · Artificial Intelligence

Why DeepSeek‑V4 Took Twice as Long: Inside the Training‑Stability Challenges and Engineering Hacks

The DeepSeek‑V4 technical report reveals that the model’s doubled training time stems from massive token and parameter scaling, severe training‑stability issues in MoE layers, and a suite of engineering solutions—including Anticipatory Routing, SwiGLU Clamping, specialist expert training, and a custom sandbox cluster—while also exposing high hallucination rates despite impressive benchmark performance.

DeepSeek V4Generative Reward ModelLLM

0 likes · 12 min read

Why DeepSeek‑V4 Took Twice as Long: Inside the Training‑Stability Challenges and Engineering Hacks

Machine Learning Algorithms & Natural Language Processing

Apr 14, 2026 · Artificial Intelligence

Revisiting On-Policy Distillation (OPD): Typical Failures and a More Stable Fix

On‑Policy Distillation (OPD) is widely used for post‑training large language models, but the sampled‑token variant often becomes unstable due to token‑level reward imbalance, teacher‑student signal mismatch on student‑generated prefixes, and tokenizer mismatches; this article analyses the bias‑variance trade‑off, identifies three root failure modes, and proposes a teacher‑top‑K local‑support‑set objective with top‑p rollout and special‑token masking that yields more stable training and better performance on both math and agentic benchmarks.

OPDOn-Policy Distillationlarge language models

0 likes · 32 min read

Revisiting On-Policy Distillation (OPD): Typical Failures and a More Stable Fix

PaperAgent

Jan 1, 2026 · Artificial Intelligence

How Manifold-Constrained Hyper-Connections Boost Large-Scale Model Training Efficiency

The article introduces mHC, a Manifold‑Constrained Hyper‑Connections technique that replaces standard residual links with multiple learned pathways, using double‑stochastic matrices to lock gradients, achieving stable training of 27‑billion‑parameter models with only 6.7% extra compute and superior performance across eight downstream benchmarks.

AI ArchitectureEfficient ImplementationManifold-Constrained

0 likes · 6 min read

How Manifold-Constrained Hyper-Connections Boost Large-Scale Model Training Efficiency

AI Frontier Lectures

Dec 9, 2025 · Artificial Intelligence

Can Token‑Level Surrogates Stabilize RL for Large Language Models? A Deep Dive

This article analyzes why optimizing sequence‑level rewards for LLMs with token‑level surrogate objectives can improve reinforcement‑learning stability, explains the theoretical conditions required, introduces Routing Replay for MoE models, and presents extensive experiments validating the approach.

Importance SamplingMixture of Expertslarge language models

0 likes · 12 min read

Can Token‑Level Surrogates Stabilize RL for Large Language Models? A Deep Dive

Baobao Algorithm Notes

Oct 30, 2025 · Artificial Intelligence

Why LLM RL Training Crashes While SFT Stays Stable: Insights & Tricks

The article examines the fundamental similarity between SFT and RL loss functions for large language models, explains why RL training is prone to instability, discusses infrastructure and data quality challenges, and reviews practical tricks and reward‑model considerations for more reliable RL fine‑tuning.

AILLMSFT

0 likes · 11 min read

Why LLM RL Training Crashes While SFT Stays Stable: Insights & Tricks

Data Party THU

Sep 10, 2025 · Industry Insights

MoE vs MoR: Deep Dive into Expert and Recursive Mixture Architectures for LLMs

This article provides a comprehensive technical comparison between Mixture of Experts (MoE) and the newly proposed Mixture of Recursion (MoR) architectures, covering design principles, parameter efficiency, inference latency, training stability, routing mechanisms, hardware deployment considerations, and suitable application scenarios.

Hardware DeploymentMixture of ExpertsMixture of Recursion

0 likes · 13 min read

MoE vs MoR: Deep Dive into Expert and Recursive Mixture Architectures for LLMs

AIWalker

Apr 13, 2025 · Artificial Intelligence

Huawei Pangu Ultra: 135B Ascend‑Native Dense LLM Without Nvidia GPUs

Huawei's Pangu Ultra introduces a 135‑billion‑parameter dense language model trained entirely on Ascend NPUs, detailing novel stability architectures, a domain‑aware tokenizer, multi‑stage pre‑training, extensive system optimizations, and benchmark results that surpass leading models such as Llama 405B and DeepSeek‑R1.

Ascend NPUDense ModelLarge Language Model

0 likes · 15 min read

Huawei Pangu Ultra: 135B Ascend‑Native Dense LLM Without Nvidia GPUs