Why Your Validation Set Fails: Outliers Are Skewing Your Data
The article explains how outliers can dramatically distort training and validation results in machine learning, outlines practical detection methods such as business rules, Z‑Score, IQR and Isolation Forest, and demonstrates cleaning techniques with a complete house‑price prediction case study in Python.
01 What Is an Outlier
An outlier is a data point that differs significantly from the majority of samples. It may be a data entry error or a rare but genuine observation, so not every "weird" value should be removed.
Why Outliers Affect Machine‑Learning Models
Many models are highly sensitive to extreme values:
Mean is pulled toward extremes.
Standard deviation inflates.
Linear regression can be "dragged" by a few outliers.
Distance‑based models (KNN, KMeans) have distorted distance structures.
Gradient‑based optimization becomes unstable.
The most dangerous aspect is that outliers bias the overall judgment rather than merely being few.
02 A Simple Illustrative Example
Consider five children’s heights (cm): four around 120 cm and one at 300 cm.
Mean
The mean becomes 157.2 cm, clearly unrealistic because the extreme value pulls it upward.
Median
Sorting the data gives a median close to the normal range, showing that the median is robust to outliers.
Mean is sensitive to outliers.
Median is more stable.
03 How to Identify Outliers
There is no universal algorithm; the choice depends on three factors:
Business meaning.
Data distribution.
Model type.
1) Business‑Rule Based
Use domain knowledge, e.g., age cannot be <0 or >120, height cannot be 500 cm, sales cannot be negative, scores cannot exceed the maximum.
2) Statistical‑Distribution Based
Z‑Score : Flag a point if its distance from the mean exceeds a threshold. Works best when data are roughly normal.
IQR (Interquartile Range) : Compute Q1 and Q3, set bounds as Q1‑1.5·IQR and Q3+1.5·IQR. Points outside are outliers. This method is robust to non‑normal data.
3) Model‑Based
When data are high‑dimensional, use algorithms such as Isolation Forest, Local Outlier Factor, One‑Class SVM, or DBSCAN. Isolation Forest is highlighted because it is easy to use via sklearn and works well in practice.
04 How to Handle Outliers
1) Delete
Suitable when the outlier is a clear entry error, the dataset is large, the outlier proportion is tiny, and removal does not break the distribution.
2) Truncate / Winsorize
Cap values below the 1st percentile to the 1st percentile and above the 99th percentile to the 99th percentile, preserving sample size while reducing extreme influence.
3) Replace with Statistics
Replace outliers with median, mean, group median, or a business‑defined value. Median is usually more stable for numeric features.
4) Transform
Apply log transformation to right‑skewed data such as income, amount, views, or sales, compressing large values while retaining differences among smaller ones.
5) Use Outlier‑Resistant Models
Tree‑based models, RobustScaler instead of StandardScaler, or robust loss functions (Huber, MAE) can mitigate outlier impact without explicit removal.
05 Complete Case Study: House‑Price Prediction
Features: area: house area rooms: number of rooms age: house age distance: distance to city center income_level: surrounding income level price: target house price
We inject outliers (huge area, extreme price, absurd distance, abnormal age) and compare model performance before and after cleaning.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import IsolationForest, RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, r2_score
from sklearn.preprocessing import RobustScaler
from sklearn.pipeline import Pipeline
# 1. Data generation
np.random.seed(42)
n = 1500
area = np.random.normal(100, 20, n).clip(40, 180)
rooms = np.random.choice([1,2,3,4,5], size=n, p=[0.1,0.25,0.35,0.2,0.1])
age = np.random.normal(12, 6, n).clip(0, 35)
distance = np.random.normal(8, 3, n).clip(0.5, 25)
income_level = np.random.normal(60, 15, n).clip(20, 120)
price = (area*18000 + rooms*80000 - age*15000 - distance*30000 + income_level*12000 +
np.random.normal(0, 150000, n))
df = pd.DataFrame({"area": area, "rooms": rooms, "age": age,
"distance": distance, "income_level": income_level,
"price": price})
# 2. Inject outliers
outlier_idx = np.random.choice(df.index, 20, replace=False)
df.loc[outlier_idx[:5], "area"] = np.random.uniform(300, 800, 5)
df.loc[outlier_idx[5:10], "price"] = np.random.uniform(15000000, 40000000, 5)
df.loc[outlier_idx[10:15], "distance"] = np.random.uniform(40, 100, 5)
df.loc[outlier_idx[15:20], "age"] = np.random.uniform(50, 120, 5)
df["price"] = df["price"].clip(200000, None)
# 3. Visualize raw data
plt.figure()
sns.scatterplot(data=df, x="area", y="price", hue="rooms", size="income_level",
palette="bright", alpha=0.8)
plt.title("Area vs Price (with outliers)")
plt.show()
# 4. Isolation Forest detection
features = ["area", "rooms", "age", "distance", "income_level", "price"]
iso = IsolationForest(n_estimators=200, contamination=0.04, random_state=42)
df["outlier_flag"] = iso.fit_predict(df[features]) # 1 = normal, -1 = outlier
# 5. Split raw vs cleaned data
df_clean = df[df["outlier_flag"] == 1].copy()
X_raw = df[["area", "rooms", "age", "distance", "income_level"]]
y_raw = df["price"]
X_clean = df_clean[["area", "rooms", "age", "distance", "income_level"]]
y_clean = df_clean["price"]
# 6. Modeling function
def train_and_evaluate(X, y, title="dataset"):
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = Pipeline([
("scaler", RobustScaler()),
("rf", RandomForestRegressor(n_estimators=300, max_depth=8, random_state=42))
])
model.fit(X_train, y_train)
pred = model.predict(X_test)
mae = mean_absolute_error(y_test, pred)
r2 = r2_score(y_test, pred)
print(f"{title} -> MAE: {mae:.2f}, R2: {r2:.4f}")
return model, X_test, y_test, pred
model_raw, X_test_raw, y_test_raw, pred_raw = train_and_evaluate(X_raw, y_raw, "Raw Data")
model_clean, X_test_clean, y_test_clean, pred_clean = train_and_evaluate(X_clean, y_clean, "Cleaned Data")
# 7. Visualize predictions
plt.figure(figsize=(10,6))
plt.scatter(y_test_raw, pred_raw, c="#FF5722", alpha=0.7, s=70, edgecolors="k")
plt.plot([y_test_raw.min(), y_test_raw.max()], [y_test_raw.min(), y_test_raw.max()], 'b--', lw=2)
plt.title("Raw Data: True vs Predicted Price")
plt.xlabel("True Price")
plt.ylabel("Predicted Price")
plt.show()
plt.figure(figsize=(10,6))
plt.scatter(y_test_clean, pred_clean, c="#00BCD4", alpha=0.7, s=70, edgecolors="k")
plt.plot([y_test_clean.min(), y_test_clean.max()], [y_test_clean.min(), y_test_clean.max()], 'r--', lw=2)
plt.title("Cleaned Data: True vs Predicted Price")
plt.xlabel("True Price")
plt.ylabel("Predicted Price")
plt.show()
# 8. Residual distribution comparison
res_raw = y_test_raw - pred_raw
res_clean = y_test_clean - pred_clean
plt.figure(figsize=(12,6))
sns.kdeplot(res_raw, fill=True, color="#E91E63", label="Raw Residuals", alpha=0.5)
sns.kdeplot(res_clean, fill=True, color="#2196F3", label="Cleaned Residuals", alpha=0.5)
plt.title("Residual Distribution Comparison")
plt.legend()
plt.show()
# 9. Feature importance (cleaned model)
rf_model = model_clean.named_steps["rf"]
importances = rf_model.feature_importances_
feature_names = X_clean.columns
imp_df = pd.DataFrame({"feature": feature_names, "importance": importances}).sort_values("importance", ascending=False)
plt.figure(figsize=(10,6))
sns.barplot(data=imp_df, x="importance", y="feature", palette="viridis")
plt.title("Feature Importance after Cleaning")
plt.show()The visualizations show that outliers inflate the mean, distort correlations (e.g., area‑price, distance‑price), and degrade model performance. After detecting and removing outliers with Isolation Forest, the MAE drops, R² improves, residuals become tighter around zero, and feature importance aligns with domain expectations (area and income level are most important).
Summary
In machine‑learning projects, outlier handling typically follows three steps:
Use business rules and visual tools (box plots, scatter plots, quantiles) to spot obvious anomalies.
Choose a treatment based on the outlier’s nature: delete obvious errors, truncate or transform extreme but valid values, or adopt robust models.
Compare model performance before and after cleaning (MAE, R², residual distribution, feature importance) to confirm that the data cleaning improved stability and generalisation.
When data are high‑dimensional, model‑based detectors like Isolation Forest are especially useful for automated outlier identification.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
IT Services Circle
Delivering cutting-edge internet insights and practical learning resources. We're a passionate and principled IT media platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
