Why Neural Networks Need Batch Normalization: Principles and Mechanics
The article explains the principle behind Batch Normalization, why it is essential for training deep neural networks, how it standardizes activations, the role of learnable scale and shift parameters, the computation steps during training and inference, and discusses placement strategies within a model.
Batch Normalization, introduced in the original Batch Normalization paper, is regarded as a transformative technique that enables faster and deeper neural networks by stabilizing training.
Background: Input Normalization
When feeding data into deep learning models, it is standard practice to normalize each feature to zero mean and unit variance. For example, one feature may range from 1 to 5 while another ranges from 1,000 to 99,999, requiring separate mean and variance calculations for each column before applying the standardization formula (shown in the accompanying diagram).
After normalization, the original values (blue) become centered around zero (red), ensuring all features share the same scale.
Problem Without Normalization
If two features have vastly different scales, the larger‑scale feature dominates the linear combination learned by the network, causing the gradient descent path to oscillate along one dimension while progressing slowly along the other. This results in a narrow‑canyon loss surface, requiring many steps to reach the minimum.
When features are on the same scale, the loss surface resembles a smooth bowl, allowing gradient descent to converge steadily.
Necessity Across Hidden Layers
The same normalization logic that applies to the network input must be applied to the activations of every hidden layer. Normalizing each layer’s inputs helps gradient descent converge more reliably, which is precisely what a Batch Normalization layer does.
How Batch Normalization Works
A Batch Norm layer is inserted between two hidden layers. It receives the activations from the preceding layer, computes per‑feature mean and variance over the mini‑batch, normalizes the activations, then applies a learned scale (γ) and shift (β) before passing the result to the next layer.
The layer also maintains exponential moving averages (EMA) of the mean and variance, using a momentum scalar (α) distinct from optimizer momentum.
Parameters
Each Batch Norm layer has two learnable parameters, β (shift) and γ (scale), and two non‑learnable statistics (running mean and variance). In a network with three hidden layers, there are three separate β and γ pairs, plus three sets of EMA statistics.
Computation Steps
Activation: Receive activations from the previous layer.
Mean & Variance: Compute the mean and variance of each activation vector across the mini‑batch.
Normalization: Subtract the mean and divide by the standard deviation, yielding zero‑mean, unit‑variance values.
Scale & Shift: Multiply by γ and add β, allowing the layer to learn an optimal scale and offset.
Moving Average: Update EMA of mean and variance using momentum α; these EMA values are stored for inference.
During back‑propagation, gradients are computed for all parameters, including β and γ, which are updated like any other weight.
Inference Phase
In training, Batch Norm computes batch statistics on each mini‑batch. At inference time, only a single sample is processed, so the stored EMA mean and variance are used instead of batch statistics.
Using EMA avoids the need to keep the entire training dataset in memory and provides a much more efficient inference computation.
Placement Debate
There are two common opinions on where to place Batch Norm in a network architecture: before the activation function (as in the original paper) or after the activation (as suggested by later works). Different placement can lead to varying performance outcomes.
The article concludes that understanding the underlying mechanics of Batch Normalization—its necessity, computation, parameters, and placement—helps practitioners design more stable and efficient deep neural networks.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Code DAO
We deliver AI algorithm tutorials and the latest news, curated by a team of researchers from Peking University, Shanghai Jiao Tong University, Central South University, and leading AI companies such as Huawei, Kuaishou, and SenseTime. Join us in the AI alchemy—making life better!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
