Fundamentals 5 min read

Why Standardize Data to Mean 0 and Variance 1?

The article explains that setting the mean to zero recenters data around the origin, making optimization algorithms converge faster, while scaling variance to one equalizes feature scales so no single feature dominates, illustrated with examples and visualizations of how standardization improves machine‑learning models.

Data Party THU
Data Party THU
Data Party THU
Why Standardize Data to Mean 0 and Variance 1?

Mean (μ)

The mean of a dataset is the arithmetic average: μ = \frac{1}{n}\sum_{i=1}^{n} x_i where n is the number of observations and x_i is each observation. The mean represents the centre of mass of the data distribution.

Variance (σ²)

Variance quantifies the spread of the data around the mean: σ² = \frac{1}{n}\sum_{i=1}^{n} (x_i - μ)^2 It is the average of the squared deviations from the mean. The standard deviation σ is the square‑root of the variance.

Why set the mean to zero?

Subtracting the mean from each observation ( x_i - μ) translates the entire point cloud to the origin of the coordinate system. Centering the data has two practical benefits:

Gradient‑based optimizers (e.g., stochastic gradient descent) see a more symmetric loss surface, which often leads to faster and more stable convergence.

Features that are far from the origin do not dominate the initial steps of the optimisation, reducing the risk of numerical overflow.

Why scale the variance to one?

Dividing each centred feature by its standard deviation ( (x_i - μ)/σ) forces the variance of every feature to be 1. This prevents features with larger numeric ranges from overwhelming those with smaller ranges. For example, consider two features:

Age: values in the range 10–70.

Salary: values in the range 10,000–70,000.

If fed directly to a model, the salary feature would appear roughly a thousand times more important simply because of its magnitude. After standardisation both features have comparable scale, allowing the model to learn from each on an equal footing.

Visual effect of standardisation

Before standardisation a low‑amplitude feature (e.g., a red line) is almost invisible when plotted together with a high‑amplitude feature (e.g., a blue line). After applying the transformation (x-μ)/σ, the amplitudes become comparable and both series are clearly visible.

When to apply standardisation

Algorithms that rely on gradient‑based optimisation benefit most from standardised inputs, including:

Logistic regression

Feed‑forward neural networks

Support‑vector machines (when using linear or RBF kernels)

k‑Nearest Neighbours (distance‑based methods)

For tree‑based models (e.g., decision trees, random forests, gradient‑boosted trees) scaling is usually unnecessary because they are invariant to monotonic transformations of the features.

Summary

Standardisation transforms each feature x into a zero‑mean, unit‑variance variable z = (x - μ) / σ. Centering removes location bias, while scaling equalises the magnitude of all features. Together they improve optimisation dynamics and prevent any single feature from dominating the learning process.

machine learningdata preprocessingfeature scalingz-scoremean normalization
Data Party THU
Written by

Data Party THU

Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.