Artificial Intelligence 8 min read

Understanding Stochastic Gradient Descent and Mini‑Batch Optimization

This article explains why traditional gradient descent struggles with massive datasets, introduces stochastic gradient descent and mini‑batch gradient descent as efficient alternatives, and provides practical guidance on batch size selection, data shuffling, and learning‑rate scheduling for deep learning models.

Hulu Beijing

Jan 30, 2018

Understanding Stochastic Gradient Descent and Mini‑Batch Optimization

Scene Description

Deep learning has rapidly dominated industry and academia due to explosive data growth. Traditional machine learning algorithms plateau as data scales, while deep learning continues improving, leading to the saying “data is king”.

Problem Description

Classic optimization methods such as gradient descent require the entire training set for each iteration, which becomes impractical for large‑scale problems. Mastering methods that handle massive training data is crucial for machine learning, especially deep learning.

Answer and Analysis

In machine learning the objective function can be expressed as

where θ denotes model parameters, x the input, f(x,θ) the model output, y the target, and L(·,·) the loss on data ( x,y ). P_{data} represents the data distribution and E the expectation. The average loss over all data is J(θ) , which we aim to minimize.

Gradient descent updates parameters as

with step size α > 0. Computing the gradient over all M training samples each step is costly when M is large.

Stochastic Gradient Descent (SGD) approximates the average loss with a single sample, enabling parameter updates using just one data point and greatly accelerating convergence, especially in online scenarios.

Mini‑batch gradient descent processes m samples per update (where m ≪ M ), reducing gradient variance and exploiting efficient matrix operations. Typical batch sizes are powers of two such as 64, 128, 256, 512.

Choose a batch size m ; powers of two often yield better computational efficiency.

Randomly shuffle the dataset before each epoch and select m consecutive samples for training.

Adopt a decaying learning rate α to speed early convergence and fine‑tune later.

In summary, mini‑batch SGD updates model parameters using only a small subset of data each iteration, dramatically speeding up convergence for large‑scale training problems.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Optimization Mini-Batch stochastic gradient descent

Written by

Hulu Beijing

Follow Hulu's official WeChat account for the latest company updates and recruitment information.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.