How Gram‑Newton‑Schulz Halves Muon Optimizer’s Compute Cost for Trillion‑Parameter Models

The article explains how the Muon optimizer’s expensive Newton‑Schulz orthogonalization is accelerated by the Gram‑Newton‑Schulz algorithm, which reduces end‑to‑end orthogonalization time by 40‑50%, achieves up to 2× speed‑up in large‑scale LLM training, and resolves numerical stability issues through a restart strategy and custom GPU kernels.

Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
How Gram‑Newton‑Schulz Halves Muon Optimizer’s Compute Cost for Trillion‑Parameter Models

Background and Motivation

In the race to train trillion‑parameter language models, even small differences in training efficiency translate into massive compute costs. The Muon optimizer, adopted by models such as Kimi K2 and GLM‑5, requires fewer optimizer steps than AdamW to reach a target loss, but each step incurs significantly higher computational overhead due to a Newton‑Schulz orthogonalization process that introduces cubic‑time matrix operations.

Why Standard Algorithms Are Slow

Traditional optimizers like AdamW perform element‑wise operations with linear time complexity. In contrast, Muon must repeatedly orthogonalize momentum matrices, a step that dominates per‑step runtime because the Newton‑Schulz iteration multiplies large rectangular matrices many times.

Gram‑Newton‑Schulz Algorithm

Tri Dao’s team (Princeton) and NYU researchers, who also authored Mamba and FlashAttention, proposed the Gram Newton‑Schulz algorithm. Instead of iterating on the huge rectangular input, the method transfers the iteration to a much smaller symmetric Gram matrix, dramatically cutting the number of floating‑point operations. The core iteration can be expressed in scalar form and then lifted back to matrix space, preserving singular vectors while operating on a symmetric matrix.

Mathematical Reconstruction

By moving the iteration to the square Gram matrix, the algorithm reduces the theoretical floating‑point count by about 68% in typical MoE training scenarios. The Newton‑Schulz polynomial implicitly computes an inverse square root, and the reformulated iteration avoids the costly rectangular multiplications that dominate the original method.

Numerical Stability Fixes

In half‑precision (bfloat16) environments, computing the Gram matrix can produce spurious negative eigenvalues that explode due to the squared terms in Muon’s update rule. The authors introduce a restart strategy: midway through the iteration they reconstruct the Gram matrix from the current approximation, eliminating accumulated negative eigenvalues and resetting the state.

Algebraic Re‑ordering and Precision Management

To prevent hidden precision loss, the addition of the identity matrix is fused into subsequent computations, keeping the entire pipeline in float32 before casting to float16 where safe. This eliminates the truncation‑induced errors that previously degraded accuracy.

Weight Splitting Strategy

The SwiGLU activation’s two components are orthogonalized separately. This split reduces the validation perplexity of Llama‑430M by ~0.2 and, because each part has a smaller matrix dimension, yields a larger speed‑up for the Gram Newton‑Schulz step.

Custom Triangular Scheduler and GPU Kernels

Leveraging CuTeDSL, the team built custom kernels for Hopper and Blackwell GPUs that exploit the symmetry of the Gram matrix. A triangular scheduler assigns work only to the lower‑triangular blocks, mirrors results to the upper triangle, and balances memory traffic, eliminating redundant half‑matrix computations.

Experimental Validation and Engineering Gains

Across Llama‑430M, Qwen‑600M, Gemma‑1B, and a 1‑billion‑parameter MoE model, the Gram Newton‑Schulz variant keeps validation perplexity within 0.01 of the original Muon. In simulated Kimi K2 pipeline stages, it delivers a 2× end‑to‑end orthogonalization speed‑up. Real‑world training shows a 40‑50% reduction in orthogonalization wall‑time, translating to substantial overall training savings.

Conclusion

Gram Newton‑Schulz combines a mathematically rigorous reformulation, numerical‑stability enhancements, and architecture‑specific kernel optimizations to break the compute bottleneck of matrix orthogonalization in modern optimizers. The open‑source implementation requires only a single hyper‑parameter—the restart iteration node—and includes an automated tuning script, offering a practical solution for compute‑constrained large‑model training.

large language modelsMuon optimizerGPU kernelsGram Newton-Schulznumerical stabilityorthogonalization
Machine Learning Algorithms & Natural Language Processing
Written by

Machine Learning Algorithms & Natural Language Processing

Focused on frontier AI technologies, empowering AI researchers' progress.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.