Artificial Intelligence 11 min read

Why So Many Optimizers? Core Algorithms Behind Neural Network Training

This article explains the fundamental gradient‑descent optimizers used in neural networks—SGD, Momentum, RMSProp, Adam and their variants—illustrates loss‑surface challenges such as local minima, saddle points and ravines, and shows how techniques like mini‑batching, momentum, adaptive learning rates and scheduling address these issues.

Code DAO

Dec 6, 2021

Why So Many Optimizers? Core Algorithms Behind Neural Network Training

Introduction

Optimizers are a key component of neural‑network architectures, helping the model learn parameters that minimize loss during training. Most deep‑learning frameworks such as PyTorch and Keras provide built‑in gradient‑descent optimizers like SGD, Momentum, RMSProp, Adam, etc.

Loss‑Surface Visualization

Consider a simple network with two weight parameters w1 and w2. The horizontal plane represents the weight axes, while the vertical axis shows the loss value for each weight combination. The blue curve traces the trajectory of gradient descent as it moves toward lower loss.

Steps:

Randomly initialize the two weights and compute the loss.

At each iteration update the weights using the gradient and a learning‑rate factor, moving to a lower‑loss region.

Continue until reaching the bottom of the loss surface.

The gradient is the slope of the loss surface, i.e., the change in loss (dL) divided by the change in weight (dW).

Real‑World Challenges of Gradient Descent

In practice the loss surface is far from the smooth convex shape shown earlier; it can be highly rugged, involve millions of parameters, and present several difficulties:

Local minima: Gradient descent can become trapped in a local minimum and fail to reach the global optimum.

Saddle points: Points that are minima along one dimension but maxima along another, often surrounded by flat regions where gradients vanish.

Ravines (pathological curvature): Narrow valleys where the curvature is steep in one direction and shallow in the orthogonal direction, causing oscillations.

Improvements Over Plain Gradient Descent

1. Stochastic Gradient Descent (SGD)

Full‑batch gradient descent computes gradients over the entire dataset each step. Mini‑batch SGD randomly selects a subset of samples for each iteration, introducing stochasticity that helps the optimizer escape flat or trapped regions and explore the loss landscape.

2. Momentum

Momentum adds an exponential moving average of past gradients to the update, allowing the optimizer to maintain direction across iterations. This mitigates oscillations in steep directions and helps traverse ravines more smoothly.

Two common momentum‑based algorithms are:

Momentum SGD

Nesterov Accelerated Gradient

3. Adaptive Learning‑Rate Methods

Because different parameters can have gradients of vastly different magnitudes, algorithms such as Adagrad, Adadelta and RMSProp adjust the learning rate for each parameter based on the history of its squared gradients.

Adagrad accumulates the sum of squared gradients, while RMSProp uses an exponential moving average, giving more weight to recent gradients. These adjustments slow down updates on steep slopes and speed up updates on shallow slopes.

4. Combining Momentum and Adaptive Rates (Adam, LAMB)

Adam and its variants (e.g., LAMB) integrate momentum (first‑moment estimate) and adaptive learning rates (second‑moment estimate) into a single update rule, providing robust performance across many loss‑surface shapes.

5. Learning‑Rate Scheduling

Beyond per‑parameter adaptation, the overall learning rate can be changed over the course of training based on epoch number or other criteria. This scheduling is typically handled by a separate component called a scheduler, not by the optimizer itself.

Conclusion

The article presented the basic techniques behind gradient‑descent optimizers, explained why many variants exist, and described how each improvement—mini‑batching, momentum, adaptive learning rates, and scheduling—addresses specific challenges of the loss surface.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Deep Learning gradient descent SGD Adam Momentum optimizers

Written by

Code DAO

We deliver AI algorithm tutorials and the latest news, curated by a team of researchers from Peking University, Shanghai Jiao Tong University, Central South University, and leading AI companies such as Huawei, Kuaishou, and SenseTime. Join us in the AI alchemy—making life better!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.