Artificial Intelligence 13 min read

Why SGD Fails and How Momentum, AdaGrad, and Adam Fix It

This article explains why vanilla Stochastic Gradient Descent often struggles in deep learning, describes the challenges of valleys and saddle points, and introduces three major SGD variants—Momentum, AdaGrad, and Adam—detailing their motivations, update rules, and advantages.

Hulu Beijing

Jan 4, 2018

Why SGD Fails and How Momentum, AdaGrad, and Adam Fix It

Scenario Description

When discussing optimization methods in deep learning, Stochastic Gradient Descent (SGD) is the first method that comes to mind, but it is not a universal solution and can become a trap that yields poor training results.

Problem Description

Why does SGD sometimes fail to provide satisfactory results, and what modifications have researchers proposed to improve it?

Answer and Analysis

(1) Reasons SGD Fails – "Feeling the Ground While Blindfolded"

Imagine walking down a mountain with your eyes open; you can see the slope and head straight for the bottom. If you are blindfolded and must rely on feeling the ground, your perception of the slope is much less accurate, leading to wrong directions or wandering paths. Traditional Gradient Descent (GD) loads the entire dataset each step to compute an exact gradient, which is accurate but costly. SGD discards this accuracy by sampling a few examples per step, making each update fast and memory‑efficient but noisy, causing unstable convergence, oscillations, and sometimes divergence. Figure 1 illustrates the parameter trajectories of GD (smooth descent) versus SGD (erratic path).

Beyond noise, SGD is especially vulnerable to two landscape features: narrow valleys and saddle points. In a valley, the true gradient points along the valley floor; a noisy SGD step can bounce between the steep walls, slowing progress. At a saddle point (a flat plateau), the gradient magnitude is near zero, so the noisy estimate may fail to detect any descent direction, causing the optimizer to stall.

(2) Solutions – Inertia Preservation and Environment Awareness

SGD updates parameters iteratively: θ_{t+1} = θ_t - η·g_t where g_t is the estimated gradient and η is the learning rate. All SGD variants keep this basic form but modify how the step direction and size are computed.

Variant 1: Momentum

Momentum adds an inertia term that accumulates past updates, reducing oscillations in valleys and helping escape flat regions. The update becomes:

v_t = γ·v_{t-1} + η·g_t

θ_{t+1} = θ_t - v_t

The inertia term reuses the previous velocity v_{t-1} , analogous to physical momentum, which smooths the trajectory and speeds up convergence.

Variant 2: AdaGrad

AdaGrad adapts the learning rate for each parameter based on the historical sum of squared gradients, which is useful when some parameters receive sparse updates (e.g., rare words in word embeddings). The update rule is:

Here the denominator grows with the square root of the accumulated squared gradients, causing the effective learning rate to shrink over time, which helps stabilize training on sparse data.

Variant 3: Adam

Adam combines the benefits of Momentum (first‑moment averaging) and AdaGrad (second‑moment averaging). It keeps running averages of gradients and squared gradients, applying bias‑correction to each:

With decay coefficients β₁ and β₂, Adam produces adaptive learning rates that work well across a wide range of deep learning tasks.

References

[1] Y. N. Dauphin et al., "Identifying and attacking the saddle point problem in high‑dimensional nonconvex optimization," arXiv, 2014.

[2] N. Qian, "On the momentum term in gradient descent learning algorithms," Neural Networks, 1999.

[3] J. Duchi, E. Hazan, Y. Singer, "Adaptive Subgradient Methods for Online Learning and Stochastic Optimization," JMLR, 2011.

[4] D. P. Kingma and J. Ba, "Adam: A Method for Stochastic Optimization," ICLR, 2015.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Optimization SGD Adam AdaGrad Momentum

Written by

Hulu Beijing

Follow Hulu's official WeChat account for the latest company updates and recruitment information.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Scenario Description

Problem Description

Answer and Analysis

(1) Reasons SGD Fails – "Feeling the Ground While Blindfolded"

(2) Solutions – Inertia Preservation and Environment Awareness

Variant 1: Momentum

Variant 2: AdaGrad

Variant 3: Adam

Further Reading

References

Hulu Beijing

How this landed with the community

Was this worth your time?

0 Comments