Mastering Outlier Detection: Techniques, Algorithms, and PyOD Implementation
Outlier detection identifies data points far from the norm, using methods such as the 3‑sigma rule, boxplots, K‑Nearest Neighbors, and numerous probabilistic and proximity‑based algorithms, with practical PyOD code examples for training, evaluating, and visualizing models across various techniques.
Outlier detection (also called anomaly detection) refers to identifying observations that are far from the majority of data points. It can significantly affect models such as linear and logistic regression, as well as ensemble methods like AdaBoost.
Common Outlier Detection Methods
One simple approach assumes data follow a known distribution (e.g., Gaussian) and visualizes the data with scatter plots or boxplots for small datasets (up to ~10k observations and 100 features). For high‑dimensional data, visualization is less effective.
3‑Sigma Rule
Based on the normal distribution, the 3‑sigma rule treats points beyond three standard deviations as outliers.
<code>def three_sigma(s):
mu, std = np.mean(s), np.std(s)
lower, upper = mu-3*std, mu+3*std
return lower, upper
</code>Boxplot (IQR Method)
The boxplot uses the interquartile range (IQR) to define lower and upper bounds; values outside are considered outliers.
<code>def boxplot(s):
q1, q3 = s.quantile(.25), s.quantile(.75)
iqr = q3 - q1
lower, upper = q1 - 1.5*iqr, q3 + 1.5*iqr
return lower, upper
</code>K‑Nearest Neighbors (KNN)
KNN computes the average distance from each sample to its K nearest neighbors and flags points whose distance exceeds a threshold. It does not assume any data distribution but only detects global outliers.
The PyOD library provides implementations of many outlier detection algorithms, including probabilistic methods (ECOD, ABOD, FastABOD, etc.), linear models (PCA, MCD, OCSVM, etc.), proximity‑based methods (LOF, HBOS, kNN, etc.), ensemble methods (Isolation Forest, Feature Bagging, etc.), and neural‑network approaches (AutoEncoder, VAE, GAN‑based models).
PyOD Quick Start (KNN Example)
<code>from pyod.models.knn import KNN # kNN detector
# train kNN detector
clf_name = 'KNN'
clf = KNN()
clf.fit(X_train)
# get prediction labels and outlier scores for training data
y_train_pred = clf.labels_ # 0: inlier, 1: outlier
y_train_scores = clf.decision_scores_
# predict on test data
y_test_pred = clf.predict(X_test) # 0 or 1
y_test_scores = clf.decision_function(X_test)
# optionally obtain confidence
y_test_pred, y_test_pred_confidence = clf.predict(X_test, return_confidence=True)
</code> <code>from pyod.utils.data import evaluate_print
print("\nOn Training Data:")
evaluate_print(clf_name, y_train, y_train_scores)
print("\nOn Test Data:")
evaluate_print(clf_name, y_test, y_test_scores)
</code> <code>visualize(clf_name, X_train, y_train, X_test, y_test,
y_train_pred, y_test_pred,
show_figure=True, save_figure=False)
</code>References: original article and the PyOD GitHub repository.
Model Perspective
Insights, knowledge, and enjoyment from a mathematical modeling researcher and educator. Hosted by Haihua Wang, a modeling instructor and author of "Clever Use of Chat for Mathematical Modeling", "Modeling: The Mathematics of Thinking", "Mathematical Modeling Practice: A Hands‑On Guide to Competitions", and co‑author of "Mathematical Modeling: Teaching Design and Cases".
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.