Understanding Stochastic Gradient Descent and Mini‑Batch Optimization
This article explains why traditional gradient descent struggles with massive datasets, introduces stochastic gradient descent and mini‑batch gradient descent as efficient alternatives, and provides practical guidance on batch size selection, data shuffling, and learning‑rate scheduling for deep learning models.
Scene Description
Deep learning has rapidly dominated industry and academia due to explosive data growth. Traditional machine learning algorithms plateau as data scales, while deep learning continues improving, leading to the saying “data is king”.
Problem Description
Classic optimization methods such as gradient descent require the entire training set for each iteration, which becomes impractical for large‑scale problems. Mastering methods that handle massive training data is crucial for machine learning, especially deep learning.
Answer and Analysis
In machine learning the objective function can be expressed as
where θ denotes model parameters, x the input, f(x,θ) the model output, y the target, and L(·,·) the loss on data ( x,y ). P_{data} represents the data distribution and E the expectation. The average loss over all data is J(θ) , which we aim to minimize.
Gradient descent updates parameters as
with step size α > 0. Computing the gradient over all M training samples each step is costly when M is large.
Stochastic Gradient Descent (SGD) approximates the average loss with a single sample, enabling parameter updates using just one data point and greatly accelerating convergence, especially in online scenarios.
Mini‑batch gradient descent processes m samples per update (where m ≪ M ), reducing gradient variance and exploiting efficient matrix operations. Typical batch sizes are powers of two such as 64, 128, 256, 512.
Choose a batch size m ; powers of two often yield better computational efficiency.
Randomly shuffle the dataset before each epoch and select m consecutive samples for training.
Adopt a decaying learning rate α to speed early convergence and fine‑tune later.
In summary, mini‑batch SGD updates model parameters using only a small subset of data each iteration, dramatically speeding up convergence for large‑scale training problems.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Hulu Beijing
Follow Hulu's official WeChat account for the latest company updates and recruitment information.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
