Mastering Anomaly vs Novelty Detection with Distribution Fitting in Python
This article explains the fundamental differences between anomaly and novelty detection, outlines how to model univariate outliers using probability distribution fitting with the distfit library, and demonstrates the workflow on synthetic height data and real natural‑gas price data, including model selection, visualization, and prediction.
Anomaly vs. Novelty: Core Concepts
Both anomaly and novelty refer to observations that deviate from the normal pattern, but their modeling assumptions differ. Anomaly assumes known outliers in the training set and fits a model on normal (inlier) data, flagging anything outside the learned range as anomalous. Novelty assumes no outliers are present during training, requiring domain knowledge to define normal boundaries.
Three Outlier Types
Global outliers (point outliers): isolated points far from the bulk of the data.
Contextual outliers : abnormal observations within a specific context (e.g., seasonal temperature extremes).
Collective outliers : groups of points that together form an abnormal pattern.
Univariate vs. Multivariate Modeling
Univariate methods evaluate a single variable at a time, making distribution fitting straightforward. Multivariate methods consider multiple features simultaneously (e.g., PCA) and are better when variables are correlated.
Distribution‑Fitting Workflow with distfit
Fit multiple probability density functions (PDFs) to the data and rank them using goodness‑of‑fit tests; optionally bootstrap to check over‑fitting.
Visualize histograms, PDFs, CDFs, and QQ‑plots.
Select the best model based on statistical scores and domain relevance.
Use the chosen model to predict whether new samples are outliers.
Example 1: Synthetic Human Height Data
Generate 10,000 height samples (mean 163 cm, std 10 cm) and fit distributions:
# import library
import numpy as np
# Generate 10000 samples from a normal distribution
X = np.random.normal(163, 10, 10000)Install and initialize distfit:
# Install distfit library
pip install distfit
from distfit import distfit
# Initialize for popular distributions with bootstrapping
dfit = distfit(distr='popular', n_boots=100)
results = dfit.fit_transform(X)
# Plot the top‑10 PDFs
dfit.plot_summary(n_top=10)
plt.show()The bootstrap and RSS scores indicate that the log‑gamma distribution best fits the height data, outperforming candidates such as Beta, Gamma, Normal, t‑distribution, GEV, and Weibull.
Refine the fit with the chosen distribution:
# Initialize for log‑gamma
dfit = distfit(distr='loggamma', alpha=0.01, bound='both')
results = dfit.fit_transform(X)
print(dfit.model)
# Save the model for later use
dfit.save('./human_height_model.pkl')Predict on new heights (130 cm, 160 cm, 200 cm):
# New human heights
y = [130, 160, 200]
results = dfit.predict(y, alpha=0.01, multtest='fdr_bh', todf=True)
print(results['df'])
# Visualize predictions
fig, ax = plt.subplots(1, 2, figsize=(20, 8))
dfit.plot(chart='PDF', ax=ax[0])
dfit.plot(chart='CDF', ax=ax[1])
plt.show()Example 2: Real‑World Natural‑Gas Spot Prices
Load the Thomson Reuters natural‑gas spot‑price dataset (6,555 points spanning 27 years) and inspect the time series:
# Import example dataset
dfit = distfit()
df = dfit.import_example(data='gas_spot_price')
# Plot the price series
dfit.lineplot(df, xlabel='Years', ylabel='Natural gas spot price', grid=True)
plt.show()Fit all available PDFs and identify the best match:
from distfit import distfit
# Search full distribution space with 100 bootstraps
dfit = distfit(distr='full', n_boots=100)
results = dfit.fit_transform(df['price'].values)
# Visualize top PDFs
fig, ax = plt.subplots(1, 2, figsize=(25, 10))
dfit.plot(chart='PDF', n_top=10, ax=ax[0])
dfit.plot(chart='CDF', n_top=10, ax=ax[1])
plt.show()The Johnson SB distribution obtains the highest bootstrap score, though visual inspection shows slight mis‑fit in the tails. Summary and QQ‑plots confirm that only Johnson SB passes the bootstrap test.
# Summary and QQ‑plot
fig, ax = plt.subplots(1, 2, figsize=(25, 10))
dfit.plot_summary(ax=ax[0])
dfit.qqplot(df['price'].values, n_top=10, ax=ax[1])
plt.show()Detect anomalies using the selected distribution:
# Predict anomalies (alpha=0.05)
dfit.predict(df['price'].values, alpha=0.05, multtest=None)
# Visualize outliers on the time series
dfit.lineplot(df['price'], labels=df.index)
plt.show()The resulting plot highlights global outliers (e.g., spikes in 2003 and 2021) and contextual outliers that fall outside the 95 % confidence interval.
Conclusion
The article differentiates anomaly and novelty detection, demonstrates how to build univariate outlier models via probability‑distribution fitting with the distfit library, and walks through both synthetic and real datasets to select appropriate PDFs, validate fits with bootstrap and QQ‑plots, and predict outliers. When no single theoretical distribution fits well, distfit also offers non‑parametric options such as percentile‑based fitting.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Data Party THU
Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
