Understanding Stochastic Gradient Descent and Mini‑Batch Optimization

This article explains why traditional gradient descent struggles with massive datasets, introduces stochastic gradient descent and mini‑batch gradient descent as efficient alternatives, and provides practical guidance on batch size selection, data shuffling, and learning‑rate scheduling for deep learning models.

Hulu Beijing
Hulu Beijing
Hulu Beijing
Understanding Stochastic Gradient Descent and Mini‑Batch Optimization

Scene Description

Deep learning has rapidly dominated industry and academia due to explosive data growth. Traditional machine learning algorithms plateau as data scales, while deep learning continues improving, leading to the saying “data is king”.

Problem Description

Classic optimization methods such as gradient descent require the entire training set for each iteration, which becomes impractical for large‑scale problems. Mastering methods that handle massive training data is crucial for machine learning, especially deep learning.

Answer and Analysis

In machine learning the objective function can be expressed as

where θ denotes model parameters, x the input, f(x,θ) the model output, y the target, and L(·,·) the loss on data ( x,y ). P_{data} represents the data distribution and E the expectation. The average loss over all data is J(θ) , which we aim to minimize.

Gradient descent updates parameters as

with step size α > 0. Computing the gradient over all M training samples each step is costly when M is large.

Stochastic Gradient Descent (SGD) approximates the average loss with a single sample, enabling parameter updates using just one data point and greatly accelerating convergence, especially in online scenarios.

Mini‑batch gradient descent processes m samples per update (where m ≪ M ), reducing gradient variance and exploiting efficient matrix operations. Typical batch sizes are powers of two such as 64, 128, 256, 512.

Choose a batch size m ; powers of two often yield better computational efficiency.

Randomly shuffle the dataset before each epoch and select m consecutive samples for training.

Adopt a decaying learning rate α to speed early convergence and fine‑tune later.

In summary, mini‑batch SGD updates model parameters using only a small subset of data each iteration, dramatically speeding up convergence for large‑scale training problems.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

OptimizationMini-Batchstochastic gradient descent
Hulu Beijing
Written by

Hulu Beijing

Follow Hulu's official WeChat account for the latest company updates and recruitment information.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.