Why Your Validation Set Fails: Outliers Are Skewing Your Data

The article explains how outliers can dramatically distort training and validation results in machine learning, outlines practical detection methods such as business rules, Z‑Score, IQR and Isolation Forest, and demonstrates cleaning techniques with a complete house‑price prediction case study in Python.

IT Services Circle
IT Services Circle
IT Services Circle
Why Your Validation Set Fails: Outliers Are Skewing Your Data

01 What Is an Outlier

An outlier is a data point that differs significantly from the majority of samples. It may be a data entry error or a rare but genuine observation, so not every "weird" value should be removed.

Why Outliers Affect Machine‑Learning Models

Many models are highly sensitive to extreme values:

Mean is pulled toward extremes.

Standard deviation inflates.

Linear regression can be "dragged" by a few outliers.

Distance‑based models (KNN, KMeans) have distorted distance structures.

Gradient‑based optimization becomes unstable.

The most dangerous aspect is that outliers bias the overall judgment rather than merely being few.

02 A Simple Illustrative Example

Consider five children’s heights (cm): four around 120 cm and one at 300 cm.

Mean

The mean becomes 157.2 cm, clearly unrealistic because the extreme value pulls it upward.

Median

Sorting the data gives a median close to the normal range, showing that the median is robust to outliers.

Mean is sensitive to outliers.

Median is more stable.

03 How to Identify Outliers

There is no universal algorithm; the choice depends on three factors:

Business meaning.

Data distribution.

Model type.

1) Business‑Rule Based

Use domain knowledge, e.g., age cannot be <0 or >120, height cannot be 500 cm, sales cannot be negative, scores cannot exceed the maximum.

2) Statistical‑Distribution Based

Z‑Score : Flag a point if its distance from the mean exceeds a threshold. Works best when data are roughly normal.

IQR (Interquartile Range) : Compute Q1 and Q3, set bounds as Q1‑1.5·IQR and Q3+1.5·IQR. Points outside are outliers. This method is robust to non‑normal data.

3) Model‑Based

When data are high‑dimensional, use algorithms such as Isolation Forest, Local Outlier Factor, One‑Class SVM, or DBSCAN. Isolation Forest is highlighted because it is easy to use via sklearn and works well in practice.

04 How to Handle Outliers

1) Delete

Suitable when the outlier is a clear entry error, the dataset is large, the outlier proportion is tiny, and removal does not break the distribution.

2) Truncate / Winsorize

Cap values below the 1st percentile to the 1st percentile and above the 99th percentile to the 99th percentile, preserving sample size while reducing extreme influence.

3) Replace with Statistics

Replace outliers with median, mean, group median, or a business‑defined value. Median is usually more stable for numeric features.

4) Transform

Apply log transformation to right‑skewed data such as income, amount, views, or sales, compressing large values while retaining differences among smaller ones.

5) Use Outlier‑Resistant Models

Tree‑based models, RobustScaler instead of StandardScaler, or robust loss functions (Huber, MAE) can mitigate outlier impact without explicit removal.

05 Complete Case Study: House‑Price Prediction

Features: area: house area rooms: number of rooms age: house age distance: distance to city center income_level: surrounding income level price: target house price

We inject outliers (huge area, extreme price, absurd distance, abnormal age) and compare model performance before and after cleaning.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import IsolationForest, RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, r2_score
from sklearn.preprocessing import RobustScaler
from sklearn.pipeline import Pipeline

# 1. Data generation
np.random.seed(42)
n = 1500
area = np.random.normal(100, 20, n).clip(40, 180)
rooms = np.random.choice([1,2,3,4,5], size=n, p=[0.1,0.25,0.35,0.2,0.1])
age = np.random.normal(12, 6, n).clip(0, 35)
distance = np.random.normal(8, 3, n).clip(0.5, 25)
income_level = np.random.normal(60, 15, n).clip(20, 120)
price = (area*18000 + rooms*80000 - age*15000 - distance*30000 + income_level*12000 +
         np.random.normal(0, 150000, n))

df = pd.DataFrame({"area": area, "rooms": rooms, "age": age,
                   "distance": distance, "income_level": income_level,
                   "price": price})

# 2. Inject outliers
outlier_idx = np.random.choice(df.index, 20, replace=False)
df.loc[outlier_idx[:5], "area"] = np.random.uniform(300, 800, 5)
df.loc[outlier_idx[5:10], "price"] = np.random.uniform(15000000, 40000000, 5)
df.loc[outlier_idx[10:15], "distance"] = np.random.uniform(40, 100, 5)
df.loc[outlier_idx[15:20], "age"] = np.random.uniform(50, 120, 5)
df["price"] = df["price"].clip(200000, None)

# 3. Visualize raw data
plt.figure()
sns.scatterplot(data=df, x="area", y="price", hue="rooms", size="income_level",
                palette="bright", alpha=0.8)
plt.title("Area vs Price (with outliers)")
plt.show()

# 4. Isolation Forest detection
features = ["area", "rooms", "age", "distance", "income_level", "price"]
iso = IsolationForest(n_estimators=200, contamination=0.04, random_state=42)
df["outlier_flag"] = iso.fit_predict(df[features])  # 1 = normal, -1 = outlier

# 5. Split raw vs cleaned data
df_clean = df[df["outlier_flag"] == 1].copy()
X_raw = df[["area", "rooms", "age", "distance", "income_level"]]
y_raw = df["price"]
X_clean = df_clean[["area", "rooms", "age", "distance", "income_level"]]
y_clean = df_clean["price"]

# 6. Modeling function
def train_and_evaluate(X, y, title="dataset"):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    model = Pipeline([
        ("scaler", RobustScaler()),
        ("rf", RandomForestRegressor(n_estimators=300, max_depth=8, random_state=42))
    ])
    model.fit(X_train, y_train)
    pred = model.predict(X_test)
    mae = mean_absolute_error(y_test, pred)
    r2 = r2_score(y_test, pred)
    print(f"{title} -> MAE: {mae:.2f}, R2: {r2:.4f}")
    return model, X_test, y_test, pred

model_raw, X_test_raw, y_test_raw, pred_raw = train_and_evaluate(X_raw, y_raw, "Raw Data")
model_clean, X_test_clean, y_test_clean, pred_clean = train_and_evaluate(X_clean, y_clean, "Cleaned Data")

# 7. Visualize predictions
plt.figure(figsize=(10,6))
plt.scatter(y_test_raw, pred_raw, c="#FF5722", alpha=0.7, s=70, edgecolors="k")
plt.plot([y_test_raw.min(), y_test_raw.max()], [y_test_raw.min(), y_test_raw.max()], 'b--', lw=2)
plt.title("Raw Data: True vs Predicted Price")
plt.xlabel("True Price")
plt.ylabel("Predicted Price")
plt.show()

plt.figure(figsize=(10,6))
plt.scatter(y_test_clean, pred_clean, c="#00BCD4", alpha=0.7, s=70, edgecolors="k")
plt.plot([y_test_clean.min(), y_test_clean.max()], [y_test_clean.min(), y_test_clean.max()], 'r--', lw=2)
plt.title("Cleaned Data: True vs Predicted Price")
plt.xlabel("True Price")
plt.ylabel("Predicted Price")
plt.show()

# 8. Residual distribution comparison
res_raw = y_test_raw - pred_raw
res_clean = y_test_clean - pred_clean
plt.figure(figsize=(12,6))
sns.kdeplot(res_raw, fill=True, color="#E91E63", label="Raw Residuals", alpha=0.5)
sns.kdeplot(res_clean, fill=True, color="#2196F3", label="Cleaned Residuals", alpha=0.5)
plt.title("Residual Distribution Comparison")
plt.legend()
plt.show()

# 9. Feature importance (cleaned model)
rf_model = model_clean.named_steps["rf"]
importances = rf_model.feature_importances_
feature_names = X_clean.columns
imp_df = pd.DataFrame({"feature": feature_names, "importance": importances}).sort_values("importance", ascending=False)
plt.figure(figsize=(10,6))
sns.barplot(data=imp_df, x="importance", y="feature", palette="viridis")
plt.title("Feature Importance after Cleaning")
plt.show()

The visualizations show that outliers inflate the mean, distort correlations (e.g., area‑price, distance‑price), and degrade model performance. After detecting and removing outliers with Isolation Forest, the MAE drops, R² improves, residuals become tighter around zero, and feature importance aligns with domain expectations (area and income level are most important).

Summary

In machine‑learning projects, outlier handling typically follows three steps:

Use business rules and visual tools (box plots, scatter plots, quantiles) to spot obvious anomalies.

Choose a treatment based on the outlier’s nature: delete obvious errors, truncate or transform extreme but valid values, or adopt robust models.

Compare model performance before and after cleaning (MAE, R², residual distribution, feature importance) to confirm that the data cleaning improved stability and generalisation.

When data are high‑dimensional, model‑based detectors like Isolation Forest are especially useful for automated outlier identification.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

machine learningPythondata cleaningoutlier detectionscikit-learnIsolation Forest
IT Services Circle
Written by

IT Services Circle

Delivering cutting-edge internet insights and practical learning resources. We're a passionate and principled IT media platform.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.