Why Randomly Masking Gradients Can Outperform Adam in Large‑Scale Model Training
The article explains how randomly masking a large portion of gradient updates during large‑model training—sometimes up to 99%—can accelerate convergence and even surpass traditional optimizers like Adam, supported by recent Google research and empirical observations.
Magma: Randomly Masked Updates in Adaptive Optimizers
Google’s paper On Surprising Effectiveness of Masking Updates in Adaptive Optimizers introduces the Magma algorithm, which randomly masks a subset of model parameters during each gradient‑update step. Empirical results show that Magma can converge faster than Adam and the Muon optimizer on large‑scale language‑model training.
Key Findings
Partial gradient updates suffice. In LLM training, randomly discarding between 40 % and 99 % of gradient updates does not reduce convergence speed or final performance; on some tasks it yields better results.
Many gradient dimensions are ineffective or harmful. Adaptive optimizers generate update vectors where a large fraction of dimensions contribute little to loss reduction or even increase loss, indicating substantial waste in parameter‑space utilization.
Full gradient synchronization is unnecessary. In multi‑node training, communicating only a tiny subset of gradients can preserve or improve model performance, reducing communication overhead.
Dynamic mask probability outperforms a fixed 50 % mask. The mask probability is adjusted according to the alignment between the momentum buffer and the current gradient: when alignment is poor, the mask probability is increased, leading to better convergence than a static 50 % mask rate.
Implications
The results suggest that, for large‑model training, strategically selecting which gradients to apply and communicate is more critical than using every available update. This aligns with the broader observation that focused updates can achieve superior convergence compared with indiscriminate full‑gradient updates.
References
https://s-sahoo.com/mdlm/
https://arxiv.org/pdf/2602.15322v1
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
AI2ML AI to Machine Learning
Original articles on artificial intelligence and machine learning, deep optimization. Less is more, life is simple! Shi Chunqi
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
