Zero‑Overhead Magma Beats Adam and Muon by Dropping Half the Gradients – 19% Perplexity Reduction on 1B‑Scale Models

Magma, a new momentum‑aligned gradient‑masking optimizer from Northwestern University and Google, discards half of the parameter updates at zero extra cost, achieving up to 19% lower perplexity than Adam and 9% lower than Muon on 1‑billion‑parameter models while providing theoretical guarantees and extensive empirical validation across heterogeneous loss landscapes.

Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Zero‑Overhead Magma Beats Adam and Muon by Dropping Half the Gradients – 19% Perplexity Reduction on 1B‑Scale Models

In current deep‑learning practice, dense optimizers such as Adam dominate because they exploit every available gradient component. A recent study by Northwestern University and Google challenges this assumption by introducing a random mask that drops half of the parameter updates without causing training collapse.

SkipUpdate and the Magma Optimizer

The authors first analyze a variant of RMSProp called SkipUpdate , where each parameter block is independently masked according to a Bernoulli distribution. When a block is masked, its update is skipped, but the momentum estimate remains dense and is rescaled to keep the estimator unbiased.

Building on SkipUpdate, they propose Magma (Momentum‑Aligned Gradient‑Masking). Magma computes an alignment score between the first‑order momentum estimate and the current gradient using cosine similarity (chosen for its scale‑invariance in large‑scale language‑model training). The score is exponentially averaged and used to modulate the mask, so that consistently aligned gradient components are retained while noisy, rapidly fluctuating components are more likely to be dropped.

Magma is a plug‑and‑play wrapper that multiplies the mask into the update direction produced by any existing adaptive optimizer, adding no extra memory or compute overhead.

Theoretical Insight: Implicit Geometric Regularization

From a classic convergence‑analysis perspective, random masking increases stochastic noise and would seem to weaken worst‑case guarantees. However, the authors show that the mask introduces a curvature‑dependent regularization term: the expected loss under SkipUpdate includes a penalty proportional to the local loss curvature along the update direction. This implicit regularization suppresses updates that align with high‑curvature (steep) directions, smoothing the optimization trajectory and biasing it toward flatter regions of the loss landscape.

The paper provides a global non‑convex stationary‑point convergence proof. Under smoothness assumptions, each Magma step satisfies a descent lemma, and the authors derive a bound on the effective smoothness constant for each parameter block. Combining this with a lower bound on the effective descent efficiency of the stochastic mask yields a final convergence rate for constant learning rates.

Empirical Evaluation

Extensive experiments validate Magma’s effectiveness:

On the C4 pre‑training of Llama‑2 models (60 M to 1 B parameters), Magma consistently improves validation perplexity across all scales, reducing the 1 B‑parameter model’s perplexity by 19% versus Adam and 9% versus Muon.

RMSProp alone diverges at the 1 B scale, but RMSProp+Magma stabilizes training and achieves the lowest perplexity (13.19), outperforming matrix‑based optimizers (Muon, SOAP) and complex enhancers (APOLLO+SGG).

In a controlled quadratic benchmark with homogeneous and heterogeneous Hessian blocks, Magma matches AdamW on homogeneous problems but converges faster and to a lower final loss on heterogeneous problems, confirming the benefit of the curvature‑aware mask.

When applied to CNNs (ResNet‑50 on CIFAR‑10), Magma does not surpass AdamW, highlighting that its advantage is tied to the highly heterogeneous loss geometry of Transformers.

For sparse Mixture‑of‑Experts (MoE) models on OpenWebText, Magma improves both Adam and Muon, despite the added complexity of dynamic token routing.

Additional analyses include:

Mask‑component ablations showing that masking both attention and MLP blocks yields the lowest perplexity (21.65) compared to masking only attention (21.92).

Mask‑granularity studies indicating that block‑level masks provide the best trade‑off between stability and memory efficiency.

Hyper‑parameter sweeps over sampling rate p and temperature τ demonstrating robustness to these settings.

Learning‑rate sensitivity experiments revealing that Adam+Magma remains stable up to a learning rate of 0.05, far beyond the narrow optimal windows of Adam and C‑Adam.

Conclusion

The work shows that dense gradient updates are not a prerequisite for large‑scale language‑model training. By introducing structured stochasticity through momentum‑aligned masking, Magma achieves implicit geometric regularization, reduces training cost, and improves stability across diverse model architectures and data regimes.

Large Language Modelsadaptive optimizationgradient maskingMagma optimizermomentum alignmentperplexity reduction
Machine Learning Algorithms & Natural Language Processing
Written by

Machine Learning Algorithms & Natural Language Processing

Focused on frontier AI technologies, empowering AI researchers' progress.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.