Why Neural Networks Need Batch Normalization: Principles and Mechanics

The article explains the principle behind Batch Normalization, why it is essential for training deep neural networks, how it standardizes activations, the role of learnable scale and shift parameters, the computation steps during training and inference, and discusses placement strategies within a model.

Code DAO
Code DAO
Code DAO
Why Neural Networks Need Batch Normalization: Principles and Mechanics

Batch Normalization, introduced in the original Batch Normalization paper, is regarded as a transformative technique that enables faster and deeper neural networks by stabilizing training.

Background: Input Normalization

When feeding data into deep learning models, it is standard practice to normalize each feature to zero mean and unit variance. For example, one feature may range from 1 to 5 while another ranges from 1,000 to 99,999, requiring separate mean and variance calculations for each column before applying the standardization formula (shown in the accompanying diagram).

After normalization, the original values (blue) become centered around zero (red), ensuring all features share the same scale.

Problem Without Normalization

If two features have vastly different scales, the larger‑scale feature dominates the linear combination learned by the network, causing the gradient descent path to oscillate along one dimension while progressing slowly along the other. This results in a narrow‑canyon loss surface, requiring many steps to reach the minimum.

When features are on the same scale, the loss surface resembles a smooth bowl, allowing gradient descent to converge steadily.

Necessity Across Hidden Layers

The same normalization logic that applies to the network input must be applied to the activations of every hidden layer. Normalizing each layer’s inputs helps gradient descent converge more reliably, which is precisely what a Batch Normalization layer does.

How Batch Normalization Works

A Batch Norm layer is inserted between two hidden layers. It receives the activations from the preceding layer, computes per‑feature mean and variance over the mini‑batch, normalizes the activations, then applies a learned scale (γ) and shift (β) before passing the result to the next layer.

The layer also maintains exponential moving averages (EMA) of the mean and variance, using a momentum scalar (α) distinct from optimizer momentum.

Parameters

Each Batch Norm layer has two learnable parameters, β (shift) and γ (scale), and two non‑learnable statistics (running mean and variance). In a network with three hidden layers, there are three separate β and γ pairs, plus three sets of EMA statistics.

Computation Steps

Activation: Receive activations from the previous layer.

Mean & Variance: Compute the mean and variance of each activation vector across the mini‑batch.

Normalization: Subtract the mean and divide by the standard deviation, yielding zero‑mean, unit‑variance values.

Scale & Shift: Multiply by γ and add β, allowing the layer to learn an optimal scale and offset.

Moving Average: Update EMA of mean and variance using momentum α; these EMA values are stored for inference.

During back‑propagation, gradients are computed for all parameters, including β and γ, which are updated like any other weight.

Inference Phase

In training, Batch Norm computes batch statistics on each mini‑batch. At inference time, only a single sample is processed, so the stored EMA mean and variance are used instead of batch statistics.

Using EMA avoids the need to keep the entire training dataset in memory and provides a much more efficient inference computation.

Placement Debate

There are two common opinions on where to place Batch Norm in a network architecture: before the activation function (as in the original paper) or after the activation (as suggested by later works). Different placement can lead to varying performance outcomes.

The article concludes that understanding the underlying mechanics of Batch Normalization—its necessity, computation, parameters, and placement—helps practitioners design more stable and efficient deep neural networks.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Deep Learningneural networksgradient descentnormalizationBatch Normalizationtraining stability
Code DAO
Written by

Code DAO

We deliver AI algorithm tutorials and the latest news, curated by a team of researchers from Peking University, Shanghai Jiao Tong University, Central South University, and leading AI companies such as Huawei, Kuaishou, and SenseTime. Join us in the AI alchemy—making life better!

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.