Local Outlier Factor (LOF) Algorithm: Theory, Workflow, Pros & Cons, and Python Implementation
This article introduces the classic density‑based anomaly detection method Local Outlier Factor (LOF), explains its underlying concepts such as k‑distance, reachability distance, and local reachability density, outlines the algorithm steps, discusses its advantages and limitations, and provides practical Python examples using PyOD and scikit‑learn.
The Local Outlier Factor (LOF) algorithm is a density‑based anomaly detection technique originally published in SIGMOD 2000 and cited over 3000 times. Unlike earlier statistical or clustering‑based methods, LOF does not assume a specific data distribution and can quantify the degree of outlierness for each point.
Core Assumption : Non‑outlier points have a surrounding density similar to that of their neighbors, while outliers have a markedly different density.
Key Concepts :
k‑distance : The distance from a point to its k‑th nearest neighbor.
k‑distance neighborhood : All points within the k‑distance radius.
Reachability distance : For points p and o, it is the maximum of the k‑distance of o and the actual distance between p and o.
Local Reachability Density (LRD) : The inverse of the average reachability distance of a point to its neighbors.
Local Outlier Factor (LOF) : The ratio of the average LRD of a point’s neighbors to the point’s own LRD; values > 1 indicate outliers.
Algorithm Workflow :
Compute pairwise distances for all points and sort them.
Identify the k‑nearest neighbors for each point and calculate its LOF score.
Interpret the LOF score: larger values mean higher outlierness, smaller values indicate normality.
Advantages : Considers both local and global data structure, works well with clusters of varying density, and is suitable for medium‑to‑high dimensional data.
Limitations : Assumes no duplicate points ≥ k, which can cause division‑by‑zero issues; the algorithm has O(n²) time complexity, prompting later optimizations such as FastLOF.
Python Implementation :
Two popular libraries can compute LOF: PyOD and scikit‑learn .
Using PyOD to generate a synthetic dataset and fit a LOF model:
from pyod.utils.data import generate_data
import numpy as np
X_train, y_train, X_test, y_test = generate_data(
n_train=200,
n_test=100,
n_features=5,
contamination=0.1,
random_state=3)
X_train = X_train * np.random.uniform(0, 1, size=X_train.shape)
X_test = X_test * np.random.uniform(0, 1, size=X_test.shape)Fit the model and evaluate:
from pyod.models.lof import LOF
clf = LOF()
clf.fit(X_train)
test_scores = clf.decision_function(X_test)
roc = round(roc_auc_score(y_test, test_scores), 4)
prn = round(precision_n_scores(y_test, test_scores), 4)
print(f'LOF ROC:{roc}, precision @ rank n:{prn}')
# Output example: LOF ROC:0.9656, precision @ rank n:0.8With scikit‑learn , the LocalOutlierFactor class can be used in two modes:
novelty=False (default): use fit_predict for training data; outlier scores are accessed via negative_outlier_factor_ .
novelty=True : only decision_function and predict are available; scores are inverted so that lower values indicate outliers.
from sklearn.neighbors import LocalOutlierFactor
clf = LocalOutlierFactor(novelty=True)
clf.fit(X_train)
test_scores = -clf.decision_function(X_test)
roc = round(roc_auc_score(y_test, test_scores), 4)
prn = round(precision_n_scores(y_test, test_scores), 4)
print(f'LOF ROC:{roc}, precision @ rank n:{prn}')Visualizations of inlier and outlier score distributions can be plotted with matplotlib and seaborn to illustrate the separation achieved by LOF.
All complete code examples are available on the author’s GitHub repository: https://github.com/xiaoyusmd/PythonDataScience .
IT Services Circle
Delivering cutting-edge internet insights and practical learning resources. We're a passionate and principled IT media platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.