Fundamentals 17 min read

Mastering Anomaly vs Novelty Detection with Distribution Fitting in Python

This article explains the fundamental differences between anomaly and novelty detection, outlines how to model univariate outliers using probability distribution fitting with the distfit library, and demonstrates the workflow on synthetic height data and real natural‑gas price data, including model selection, visualization, and prediction.

Data Party THU
Data Party THU
Data Party THU
Mastering Anomaly vs Novelty Detection with Distribution Fitting in Python

Anomaly vs. Novelty: Core Concepts

Both anomaly and novelty refer to observations that deviate from the normal pattern, but their modeling assumptions differ. Anomaly assumes known outliers in the training set and fits a model on normal (inlier) data, flagging anything outside the learned range as anomalous. Novelty assumes no outliers are present during training, requiring domain knowledge to define normal boundaries.

Three Outlier Types

Global outliers (point outliers): isolated points far from the bulk of the data.

Contextual outliers : abnormal observations within a specific context (e.g., seasonal temperature extremes).

Collective outliers : groups of points that together form an abnormal pattern.

Univariate vs. Multivariate Modeling

Univariate methods evaluate a single variable at a time, making distribution fitting straightforward. Multivariate methods consider multiple features simultaneously (e.g., PCA) and are better when variables are correlated.

Distribution‑Fitting Workflow with distfit

Fit multiple probability density functions (PDFs) to the data and rank them using goodness‑of‑fit tests; optionally bootstrap to check over‑fitting.

Visualize histograms, PDFs, CDFs, and QQ‑plots.

Select the best model based on statistical scores and domain relevance.

Use the chosen model to predict whether new samples are outliers.

Example 1: Synthetic Human Height Data

Generate 10,000 height samples (mean 163 cm, std 10 cm) and fit distributions:

# import library
import numpy as np
# Generate 10000 samples from a normal distribution
X = np.random.normal(163, 10, 10000)

Install and initialize distfit:

# Install distfit library
pip install distfit

from distfit import distfit
# Initialize for popular distributions with bootstrapping
dfit = distfit(distr='popular', n_boots=100)
results = dfit.fit_transform(X)
# Plot the top‑10 PDFs
dfit.plot_summary(n_top=10)
plt.show()

The bootstrap and RSS scores indicate that the log‑gamma distribution best fits the height data, outperforming candidates such as Beta, Gamma, Normal, t‑distribution, GEV, and Weibull.

Refine the fit with the chosen distribution:

# Initialize for log‑gamma
dfit = distfit(distr='loggamma', alpha=0.01, bound='both')
results = dfit.fit_transform(X)
print(dfit.model)
# Save the model for later use
dfit.save('./human_height_model.pkl')

Predict on new heights (130 cm, 160 cm, 200 cm):

# New human heights
y = [130, 160, 200]
results = dfit.predict(y, alpha=0.01, multtest='fdr_bh', todf=True)
print(results['df'])
# Visualize predictions
fig, ax = plt.subplots(1, 2, figsize=(20, 8))
dfit.plot(chart='PDF', ax=ax[0])
dfit.plot(chart='CDF', ax=ax[1])
plt.show()

Example 2: Real‑World Natural‑Gas Spot Prices

Load the Thomson Reuters natural‑gas spot‑price dataset (6,555 points spanning 27 years) and inspect the time series:

# Import example dataset
dfit = distfit()
df = dfit.import_example(data='gas_spot_price')
# Plot the price series
dfit.lineplot(df, xlabel='Years', ylabel='Natural gas spot price', grid=True)
plt.show()

Fit all available PDFs and identify the best match:

from distfit import distfit
# Search full distribution space with 100 bootstraps
dfit = distfit(distr='full', n_boots=100)
results = dfit.fit_transform(df['price'].values)
# Visualize top PDFs
fig, ax = plt.subplots(1, 2, figsize=(25, 10))
dfit.plot(chart='PDF', n_top=10, ax=ax[0])
dfit.plot(chart='CDF', n_top=10, ax=ax[1])
plt.show()

The Johnson SB distribution obtains the highest bootstrap score, though visual inspection shows slight mis‑fit in the tails. Summary and QQ‑plots confirm that only Johnson SB passes the bootstrap test.

# Summary and QQ‑plot
fig, ax = plt.subplots(1, 2, figsize=(25, 10))
dfit.plot_summary(ax=ax[0])
dfit.qqplot(df['price'].values, n_top=10, ax=ax[1])
plt.show()

Detect anomalies using the selected distribution:

# Predict anomalies (alpha=0.05)
dfit.predict(df['price'].values, alpha=0.05, multtest=None)
# Visualize outliers on the time series
dfit.lineplot(df['price'], labels=df.index)
plt.show()

The resulting plot highlights global outliers (e.g., spikes in 2003 and 2021) and contextual outliers that fall outside the 95 % confidence interval.

Conclusion

The article differentiates anomaly and novelty detection, demonstrates how to build univariate outlier models via probability‑distribution fitting with the distfit library, and walks through both synthetic and real datasets to select appropriate PDFs, validate fits with bootstrap and QQ‑plots, and predict outliers. When no single theoretical distribution fits well, distfit also offers non‑parametric options such as percentile‑based fitting.

Image
Image
Outlier Types
Outlier Types
PDF vs CDF Comparison
PDF vs CDF Comparison
PDF and CDF Plots
PDF and CDF Plots
Anomaly Detection on Gas Prices
Anomaly Detection on Gas Prices
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Pythonanomaly detectionStatistical Modelingdistribution fittingdistfitnovelty detectionoutlier analysis
Data Party THU
Written by

Data Party THU

Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.