gradient masking — 2 Technical Articles

Feb 22, 2026 · Artificial Intelligence

How Skipping 50% of Gradient Updates Supercharges LLM Training (SkipUpdate & Magma)

A recent Google‑Northwestern study reveals that randomly discarding half of parameter updates during training—implemented as the SkipUpdate strategy—consistently outperforms dense optimizers across Llama models, and its extension Magma adds momentum‑gradient alignment to achieve further gains, offering a zero‑overhead, geometry‑aware regularization for large‑scale LLMs.

MagmaOptimizationSkipUpdate

0 likes · 9 min read

How Skipping 50% of Gradient Updates Supercharges LLM Training (SkipUpdate & Magma)

Machine Learning Algorithms & Natural Language Processing

Feb 21, 2026 · Artificial Intelligence

Zero‑Overhead Magma Beats Adam and Muon by Dropping Half the Gradients – 19% Perplexity Reduction on 1B‑Scale Models

Magma, a new momentum‑aligned gradient‑masking optimizer from Northwestern University and Google, discards half of the parameter updates at zero extra cost, achieving up to 19% lower perplexity than Adam and 9% lower than Muon on 1‑billion‑parameter models while providing theoretical guarantees and extensive empirical validation across heterogeneous loss landscapes.

Magma optimizeradaptive optimizationgradient masking

0 likes · 11 min read

Zero‑Overhead Magma Beats Adam and Muon by Dropping Half the Gradients – 19% Perplexity Reduction on 1B‑Scale Models