Comprehensive Overview of Common Anomaly Detection Methods with Code Examples
This article compiles and explains a variety of common anomaly detection techniques—including distribution‑based, distance‑based, density‑based, clustering, tree‑based, dimensionality‑reduction, classification, and prediction methods—providing algorithm descriptions, workflow steps, advantages, limitations, and ready‑to‑run Python code snippets for each approach.
The article gathers a broad set of anomaly detection algorithms frequently used in data analysis and machine‑learning tasks. Each method is introduced with its theoretical basis, typical workflow, strengths, and drawbacks, followed by concise Python implementations.
1. Distribution‑Based Methods
3‑Sigma : assumes normal distribution; points beyond three standard deviations are outliers. Example implementation:
def three_sigma(s):
mu, std = np.mean(s), np.std(s)
lower, upper = mu - 3*std, mu + 3*std
return lower, upperZ‑Score : standard score of each point; threshold of 3 mirrors the 3‑sigma rule.
def z_score(s):
return (s - np.mean(s)) / np.std(s)Boxplot (IQR) : uses inter‑quartile range to define lower and upper bounds.
def boxplot(s):
q1, q3 = s.quantile(0.25), s.quantile(0.75)
iqr = q3 - q1
lower, upper = q1 - 1.5*iqr, q3 + 1.5*iqr
return lower, upperGrubbs’ Test : hypothesis test for a single outlier in a normally‑distributed sample. Steps include sorting, computing mean/std, evaluating the most extreme value against a critical value.
from outliers import smirnov_grubbs as grubbs
print(grubbs.test([8,9,10,1,9], alpha=0.05))2. Distance‑Based Methods
K‑Nearest Neighbors (KNN) : average distance to the K nearest points; large distance indicates an outlier.
from pyod.models.knn import KNN
clf = KNN(method='mean', n_neighbors=3)
clf.fit(X_train)
y_train_pred = clf.labels_3. Density‑Based Methods
Local Outlier Factor (LOF) : compares local density of a point to that of its neighbors.
from sklearn.neighbors import LocalOutlierFactor as LOF
clf = LOF(n_neighbors=2)
res = clf.fit_predict(X)Connectivity‑Based Outlier Factor (COF) : similar to LOF but uses average chain distance; implemented via pyod.models.cof .
4. Clustering‑Based Methods
DBSCAN : points not belonging to any dense cluster are labeled as noise (outliers).
from sklearn.cluster import DBSCAN
clustering = DBSCAN(eps=3, min_samples=2).fit(X)
labels = clustering.labels_5. Tree‑Based Methods
Isolation Forest : builds random trees; points isolated with short path lengths are considered anomalies.
from sklearn.ensemble import IsolationForest
iforest = IsolationForest(n_estimators=100, contamination=0.05)
iforest.fit(X)
labels = iforest.predict(X)6. Dimensionality‑Reduction Methods
Principal Component Analysis (PCA) : evaluates reconstruction error or deviation along principal components to flag outliers.
from sklearn.decomposition import PCA
pca = PCA()
transformed = pca.fit_transform(X)
# compute anomaly scores using eigenvaluesAutoEncoder : trains a neural network to reconstruct normal data; high reconstruction error signals an anomaly.
import tensorflow as tf
model = tf.keras.Sequential([
tf.keras.layers.Dense(10, activation='relu'),
tf.keras.layers.Dense(2, activation='relu'),
tf.keras.layers.Dense(10, activation='relu'),
tf.keras.layers.Dense(input_dim)
])
model.compile(loss='mse', optimizer='adam')
model.fit(X_train, X_train, epochs=100, batch_size=10)
recon_error = np.mean(np.abs(model.predict(X_test) - X_test), axis=1)7. Classification‑Based Methods
One‑Class SVM : learns a boundary that encloses the majority of data; points outside are outliers.
from sklearn import svm
clf = svm.OneClassSVM(nu=0.1, kernel='rbf', gamma=0.1)
clf.fit(X)
labels = clf.predict(X)8. Prediction‑Based Methods
For time‑series, predict future values, compute residuals, and apply statistical thresholds (e.g., K‑sigma) to detect anomalies.
Conclusion
The survey categorizes anomaly detection techniques into distribution, distance, density, clustering, tree, dimensionality‑reduction, classification, and prediction families, highlighting their algorithms, typical use‑cases, pros and cons, and providing ready‑to‑run Python code for each.
Python Programming Learning Circle
A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.