Why Standardize Data to Mean 0 and Variance 1?
The article explains that setting the mean to zero recenters data around the origin, making optimization algorithms converge faster, while scaling variance to one equalizes feature scales so no single feature dominates, illustrated with examples and visualizations of how standardization improves machine‑learning models.
Mean (μ)
The mean of a dataset is the arithmetic average: μ = \frac{1}{n}\sum_{i=1}^{n} x_i where n is the number of observations and x_i is each observation. The mean represents the centre of mass of the data distribution.
Variance (σ²)
Variance quantifies the spread of the data around the mean: σ² = \frac{1}{n}\sum_{i=1}^{n} (x_i - μ)^2 It is the average of the squared deviations from the mean. The standard deviation σ is the square‑root of the variance.
Why set the mean to zero?
Subtracting the mean from each observation ( x_i - μ) translates the entire point cloud to the origin of the coordinate system. Centering the data has two practical benefits:
Gradient‑based optimizers (e.g., stochastic gradient descent) see a more symmetric loss surface, which often leads to faster and more stable convergence.
Features that are far from the origin do not dominate the initial steps of the optimisation, reducing the risk of numerical overflow.
Why scale the variance to one?
Dividing each centred feature by its standard deviation ( (x_i - μ)/σ) forces the variance of every feature to be 1. This prevents features with larger numeric ranges from overwhelming those with smaller ranges. For example, consider two features:
Age: values in the range 10–70.
Salary: values in the range 10,000–70,000.
If fed directly to a model, the salary feature would appear roughly a thousand times more important simply because of its magnitude. After standardisation both features have comparable scale, allowing the model to learn from each on an equal footing.
Visual effect of standardisation
Before standardisation a low‑amplitude feature (e.g., a red line) is almost invisible when plotted together with a high‑amplitude feature (e.g., a blue line). After applying the transformation (x-μ)/σ, the amplitudes become comparable and both series are clearly visible.
When to apply standardisation
Algorithms that rely on gradient‑based optimisation benefit most from standardised inputs, including:
Logistic regression
Feed‑forward neural networks
Support‑vector machines (when using linear or RBF kernels)
k‑Nearest Neighbours (distance‑based methods)
For tree‑based models (e.g., decision trees, random forests, gradient‑boosted trees) scaling is usually unnecessary because they are invariant to monotonic transformations of the features.
Summary
Standardisation transforms each feature x into a zero‑mean, unit‑variance variable z = (x - μ) / σ. Centering removes location bias, while scaling equalises the magnitude of all features. Together they improve optimisation dynamics and prevent any single feature from dominating the learning process.
Data Party THU
Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
