Deep Learning Optimizers Demystified: Momentum, AdaGrad, RMSProp & Adam Explained
This article breaks down the core deep‑learning optimizers—gradient descent, Momentum, AdaGrad, RMSProp and Adam—showing why vanilla gradient descent converges slowly, how each method uses exponential moving averages to accelerate training, and why Adam is generally the preferred choice.
Deep learning has driven breakthroughs in AI, but neural networks often contain millions or billions of trainable parameters, making accelerated training essential. Back‑propagation updates weights via gradient descent, which, in its vanilla form, converges slowly and is rarely optimal for deep models.
Gradient descent computes the loss gradient and updates parameters directly. A two‑dimensional “canyon” example illustrates that without storing historical gradient information the optimizer oscillates along the steep direction, making large learning rates risky and convergence sluggish.
Momentum addresses this by taking larger steps horizontally and smaller adjustments vertically, using two equations that maintain an exponential moving average (EMA) of gradients (image). The EMA captures the trend of past gradients, while the learning rate α scales the update. As Sebastian Ruder notes, the momentum term amplifies updates in consistent gradient directions and suppresses them when gradients fluctuate, speeding convergence and reducing oscillation. A typical setting is β≈0.9.
AdaGrad adapts the learning rate per weight based on the magnitude of its gradients. By accumulating the squared gradients (∑dw²) and scaling the base learning rate α by the inverse square root of this sum (plus a small ε), AdaGrad automatically reduces the step size for frequently large gradients and increases it for small ones, mitigating exploding or vanishing gradients. However, because the denominator grows monotonically, the effective learning rate decays continuously, slowing later‑stage training.
RMSProp improves AdaGrad by replacing the full sum of squared gradients with an EMA of the squares, preventing perpetual decay of the learning rate. The update rule mirrors AdaGrad’s (image) but computes vₜ via EMA, allowing the optimizer to adapt quickly to recent gradient trends. Experiments show faster convergence than AdaGrad, and a typical β≈1 is recommended. The article also discusses a naïve variant that uses only the sign of the gradient, which is highly sensitive to α and discards magnitude information, limiting its usefulness except in extreme memory‑constrained settings.
Adam combines Momentum and RMSProp by maintaining EMA of both gradients and squared gradients, with bias‑correction in the early iterations. This hybrid approach works well across diverse network architectures. The default hyper‑parameters from the original Adam paper are β₁=0.9, β₂=0.999, and ε=1e‑8, offering strong performance with minimal tuning.
In summary, while Momentum and RMSProp each address specific shortcomings of vanilla gradient descent, Adam’s blend of both techniques makes it the most versatile and memory‑efficient optimizer for most deep‑learning tasks.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
AI Algorithm Path
A public account focused on deep learning, computer vision, and autonomous driving perception algorithms, covering visual CV, neural networks, pattern recognition, related hardware and software configurations, and open-source projects.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
