Why Randomly Masking Gradients Can Outperform Adam in Large‑Scale Model Training

The article explains how randomly masking a large portion of gradient updates during large‑model training—sometimes up to 99%—can accelerate convergence and even surpass traditional optimizers like Adam, supported by recent Google research and empirical observations.

AI2ML AI to Machine Learning
AI2ML AI to Machine Learning
AI2ML AI to Machine Learning
Why Randomly Masking Gradients Can Outperform Adam in Large‑Scale Model Training

Magma: Randomly Masked Updates in Adaptive Optimizers

Google’s paper On Surprising Effectiveness of Masking Updates in Adaptive Optimizers introduces the Magma algorithm, which randomly masks a subset of model parameters during each gradient‑update step. Empirical results show that Magma can converge faster than Adam and the Muon optimizer on large‑scale language‑model training.

Key Findings

Partial gradient updates suffice. In LLM training, randomly discarding between 40 % and 99 % of gradient updates does not reduce convergence speed or final performance; on some tasks it yields better results.

Many gradient dimensions are ineffective or harmful. Adaptive optimizers generate update vectors where a large fraction of dimensions contribute little to loss reduction or even increase loss, indicating substantial waste in parameter‑space utilization.

Full gradient synchronization is unnecessary. In multi‑node training, communicating only a tiny subset of gradients can preserve or improve model performance, reducing communication overhead.

Dynamic mask probability outperforms a fixed 50 % mask. The mask probability is adjusted according to the alignment between the momentum buffer and the current gradient: when alignment is poor, the mask probability is increased, leading to better convergence than a static 50 % mask rate.

Implications

The results suggest that, for large‑model training, strategically selecting which gradients to apply and communicate is more critical than using every available update. This aligns with the broader observation that focused updates can achieve superior convergence compared with indiscriminate full‑gradient updates.

References

https://s-sahoo.com/mdlm/

https://arxiv.org/pdf/2602.15322v1

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

large language modelsDistributed Traininggradient optimizationadaptive optimizersMagma algorithmrandom masking
AI2ML AI to Machine Learning
Written by

AI2ML AI to Machine Learning

Original articles on artificial intelligence and machine learning, deep optimization. Less is more, life is simple! Shi Chunqi

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.