Various Anomaly Detection Techniques with Python Code Examples
This article introduces ten common anomaly detection approaches—including statistical thresholds, boxplots, clustering, isolation forest, LOF, collaborative filtering, robust covariance, NLP, computer‑vision, and time‑series methods—each accompanied by concise Python code snippets illustrating how to identify outliers in different data domains.
Anomaly detection is a crucial task in data analysis, used to identify data points that deviate from normal patterns in fields such as finance, manufacturing, and cybersecurity.
1. Statistical method : Detect outliers by computing the mean and standard deviation and flagging points beyond a chosen threshold.
import numpy as np
# Sample data
data = [1, 2, 3, 100, 5, 6, 7]
mean = np.mean(data)
std = np.std(data)
threshold = mean + 3 * std
outliers = [x for x in data if x > threshold]
print("异常数据点:", outliers)2. Box‑plot method : Use a box‑plot (or quartiles) to spot extreme values.
import matplotlib.pyplot as plt
# Sample data
data = [1, 2, 3, 100, 5, 6, 7]
plt.boxplot(data)
plt.show()3. Clustering‑based detection : Apply K‑means to group data and treat points belonging to a small cluster as anomalies.
from sklearn.cluster import KMeans
X = [[1, 2], [2, 4], [3, 6], [100, 200], [5, 10], [6, 12], [7, 14]]
model = KMeans(n_clusters=2)
model.fit(X)
labels = model.labels_
outliers = [X[i] for i in range(len(X)) if labels[i] != labels[np.argmax(np.bincount(labels))]]
print("异常数据点:", outliers)4. Isolation Forest : Build an isolation forest model to separate anomalous points from normal ones.
from sklearn.ensemble import IsolationForest
X = [[1, 2], [2, 4], [3, 6], [100, 200], [5, 10], [6, 12], [7, 14]]
model = IsolationForest()
model.fit(X)
scores = model.decision_function(X)
outliers = [X[i] for i in range(len(X)) if scores[i] < 0]
print("异常数据点:", outliers)5. Local Outlier Factor (LOF) : Compute a LOF score for each point; points with a score of –1 are considered outliers.
from sklearn.neighbors import LocalOutlierFactor
X = [[1, 2], [2, 4], [3, 6], [100, 200], [5, 10], [6, 12], [7, 14]]
model = LocalOutlierFactor()
scores = model.fit_predict(X)
outliers = [X[i] for i in range(len(X)) if scores[i] == -1]
print("异常数据点:", outliers)6. Collaborative‑filtering detection : Use a KNN‑based recommender to find abnormal user‑item rating deviations.
import pandas as pd
from surprise import Dataset, Reader, KNNWithMeans
# Assume a DataFrame with columns user_id, item_id, rating
data = pd.read_csv("data.csv")
reader = Reader(rating_scale=(1, 5))
dataset = Dataset.load_from_df(data[["user_id", "item_id", "rating"]], reader)
trainset = dataset.build_full_trainset()
model = KNNWithMeans()
model.fit(trainset)
predictions = model.test(trainset.build_testset())
outliers = [p for p in predictions if abs(p.est - p.r_ui) > 2]
print("异常评分:", outliers)7. Robust Covariance / One‑Class SVM : Apply EllipticEnvelope or OneClassSVM to compute anomaly scores.
from sklearn.covariance import EllipticEnvelope
from sklearn.svm import OneClassSVM
X = [[1, 2], [2, 4], [3, 6], [100, 200], [5, 10], [6, 12], [7, 14]]
model = EllipticEnvelope(contamination=0.1)
model.fit(X)
scores = model.decision_function(X)
outliers = [X[i] for i in range(len(X)) if scores[i] < 0]
print("异常数据点:", outliers)8. Text anomaly detection : Use NLTK to tokenize text and flag low‑frequency words as abnormal.
import nltk
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist
text = "This is an example sentence with some abnormal words like zxy and abc."
tokens = word_tokenize(text)
fdist = FreqDist(tokens)
threshold = 2
outliers = [word for word, freq in fdist.items() if freq < threshold]
print("异常词:", outliers)9. Image anomaly detection : Convert an image to grayscale, compute pixel‑value standard deviation, and mark regions exceeding a threshold.
import cv2
import numpy as np
image = cv2.imread("image.jpg")
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
std = np.std(gray)
threshold = 50
outliers = np.where(std > threshold)
print("异常区域:", outliers)10. Time‑series anomaly detection : Decompose a series with STL, then treat residuals beyond a threshold as outliers.
import pandas as pd
from statsmodels.tsa.seasonal import STL
data = pd.read_csv("data.csv")
data["date"] = pd.to_datetime(data["date"])
data.set_index("date", inplace=True)
stl = STL(data["value"], seasonal=13)
result = stl.fit()
residuals = result.resid
threshold = 2
outliers = residuals[abs(residuals) > threshold]
print("异常数据点:", outliers)These code snippets illustrate how anomaly detection can be applied across different data types and domains, allowing practitioners to choose the most suitable technique for their specific needs.
Test Development Learning Exchange
Test Development Learning Exchange
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.