Why SGD Fails and How Momentum, AdaGrad, and Adam Fix It
This article explains why vanilla Stochastic Gradient Descent often struggles in deep learning, describes the challenges of valleys and saddle points, and introduces three major SGD variants—Momentum, AdaGrad, and Adam—detailing their motivations, update rules, and advantages.
Scenario Description
When discussing optimization methods in deep learning, Stochastic Gradient Descent (SGD) is the first method that comes to mind, but it is not a universal solution and can become a trap that yields poor training results.
Problem Description
Why does SGD sometimes fail to provide satisfactory results, and what modifications have researchers proposed to improve it?
Answer and Analysis
(1) Reasons SGD Fails – "Feeling the Ground While Blindfolded"
Imagine walking down a mountain with your eyes open; you can see the slope and head straight for the bottom. If you are blindfolded and must rely on feeling the ground, your perception of the slope is much less accurate, leading to wrong directions or wandering paths. Traditional Gradient Descent (GD) loads the entire dataset each step to compute an exact gradient, which is accurate but costly. SGD discards this accuracy by sampling a few examples per step, making each update fast and memory‑efficient but noisy, causing unstable convergence, oscillations, and sometimes divergence. Figure 1 illustrates the parameter trajectories of GD (smooth descent) versus SGD (erratic path).
Beyond noise, SGD is especially vulnerable to two landscape features: narrow valleys and saddle points. In a valley, the true gradient points along the valley floor; a noisy SGD step can bounce between the steep walls, slowing progress. At a saddle point (a flat plateau), the gradient magnitude is near zero, so the noisy estimate may fail to detect any descent direction, causing the optimizer to stall.
(2) Solutions – Inertia Preservation and Environment Awareness
SGD updates parameters iteratively:
θ_{t+1} = θ_t - η·g_twhere g_t is the estimated gradient and η is the learning rate. All SGD variants keep this basic form but modify how the step direction and size are computed.
Variant 1: Momentum
Momentum adds an inertia term that accumulates past updates, reducing oscillations in valleys and helping escape flat regions. The update becomes:
v_t = γ·v_{t-1} + η·g_t θ_{t+1} = θ_t - v_tThe inertia term reuses the previous velocity v_{t-1} , analogous to physical momentum, which smooths the trajectory and speeds up convergence.
Variant 2: AdaGrad
AdaGrad adapts the learning rate for each parameter based on the historical sum of squared gradients, which is useful when some parameters receive sparse updates (e.g., rare words in word embeddings). The update rule is:
Here the denominator grows with the square root of the accumulated squared gradients, causing the effective learning rate to shrink over time, which helps stabilize training on sparse data.
Variant 3: Adam
Adam combines the benefits of Momentum (first‑moment averaging) and AdaGrad (second‑moment averaging). It keeps running averages of gradients and squared gradients, applying bias‑correction to each:
With decay coefficients β₁ and β₂, Adam produces adaptive learning rates that work well across a wide range of deep learning tasks.
Further Reading
Other notable SGD variants include Nesterov Accelerated Gradient, AdaDelta, RMSProp, AdaMax, and Nadam, each extending the ideas of inertia or adaptive learning rates.
References
[1] Y. N. Dauphin et al., "Identifying and attacking the saddle point problem in high‑dimensional nonconvex optimization," arXiv, 2014.
[2] N. Qian, "On the momentum term in gradient descent learning algorithms," Neural Networks, 1999.
[3] J. Duchi, E. Hazan, Y. Singer, "Adaptive Subgradient Methods for Online Learning and Stochastic Optimization," JMLR, 2011.
[4] D. P. Kingma and J. Ba, "Adam: A Method for Stochastic Optimization," ICLR, 2015.
Hulu Beijing
Follow Hulu's official WeChat account for the latest company updates and recruitment information.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.