How Batch Normalization Accelerates Neural Network Training and Improves Generalization
This article explains the motivation, core principles, and implementation details of Batch Normalization, including how it normalizes each mini‑batch, restores learned feature distributions, and is applied in convolutional neural networks to speed up training and boost model generalization.
Scene Description
Training deep neural networks involves many hand‑tuned hyper‑parameters such as learning rate, weight decay, and dropout ratio, which significantly affect the final model performance. Batch Normalization (BN) tackles these issues by normalizing data distributions, thereby accelerating convergence and enhancing generalization.
Problem Description
What are the basic motivation and principle of BN?
How does BN restore the feature distribution learned by the previous layer?
Briefly describe how BN is used in convolutional neural networks.
Background Knowledge
Statistical learning, deep learning
Answer and Analysis
1. BN Basic Motivation and Principle
The essence of neural network training is to learn the data distribution; a mismatch between training and test distributions reduces generalization ability, so inputs are normalized before training.
During training, parameters of each hidden layer change, causing the input distribution of the next layer to shift. Consequently, each mini‑batch sees a different data distribution, increasing training complexity and over‑fitting risk.
Batch Normalization addresses this by inserting a normalization step before the input of each layer for every mini‑batch, forcing the data to have zero mean and unit variance. For a neuron in dimension k , the operation follows the formula shown below:
2. Restoring the Learned Feature Distribution
Directly applying the above normalization would alter the distribution learned by the previous layer. For example, after BN the data may fall into the non‑saturated region of a sigmoid activation, destroying the originally learned features. To recover the original distribution, BN introduces learnable scale and shift parameters γ and β :
Here, the two parameters correspond to the variance and bias of the input distribution. In a standard network without BN, these parameters depend heavily on the learned weights of preceding layers. After BN, γ and β become independent learnable parameters of the current layer, facilitating optimization.
The complete forward‑propagation formula of a Batch Normalization layer is illustrated below:
3. Using BN in Convolutional Neural Networks
In CNNs, each layer produces a set of feature maps. BN is applied to each feature map similarly to weight sharing in fully‑connected layers, treating each map as a single processing unit.
Assume a mini‑batch contains b samples, each feature map has f channels, and each map has width w and height h . The total number of neurons per feature map is b × w × h . These neurons provide the statistics needed to compute the learnable parameters γ and β for that map. The mini‑batch size m equals b × w × h , which is used in the BN computation as described in the previous section.
Hulu Beijing
Follow Hulu's official WeChat account for the latest company updates and recruitment information.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
