Artificial Intelligence 19 min read

Mastering Anomaly Detection: From Moving Averages to Isolation Forests

This comprehensive guide explores a wide range of anomaly detection techniques—including time‑series methods, statistical models, distance‑based approaches, tree‑based isolation forests, graph algorithms, behavior‑sequence Markov models, and supervised machine‑learning models—detailing their principles, formulas, and practical scenarios for detecting outliers in advertising, fraud, and system monitoring.

Alibaba Cloud Developer

May 22, 2019

Mastering Anomaly Detection: From Moving Averages to Isolation Forests

Background

Outlier detection (also called anomaly detection) is the process of identifying data points that deviate significantly from expected behavior. Detected outliers are used in many applications such as credit‑card fraud prevention, industrial fault detection, and ad‑click anti‑cheating.

An outlier is a data object that is markedly different from other objects. In the illustration, points inside regions N1 and N2 are normal, while points O1, O2, O3 far from N1/N2 are outliers.

A major difficulty in anomaly detection is the lack of ground truth. A common practice is to first use unsupervised methods to mine suspicious samples, then fuse multiple features with supervised models for more comprehensive fraud detection.

1. Time Series

1.1 Moving Average (MA)

1.1.1 Simple Moving Average (SMA)

The simple moving average smooths a time series by averaging historical values within a window of size T. When the absolute difference between the current value and the SMA exceeds a threshold, the point is considered anomalous.

Applicable to: a. smoothing noisy data; b. forecasting future values.

1.1.2 Weighted Moving Average (WMA)

WMA assigns higher weights to more recent observations, making it more responsive to recent changes than SMA, while still retaining some lag.

1.1.3 Exponential Moving Average (EMA)

EMA uses exponentially decreasing weights, ensuring that even very old observations retain a small influence. The smoothing factor α (0 < α < 1) controls the decay rate; larger α makes the average react faster to recent changes.

1.2 Year‑over‑Year and Month‑over‑Month

YoY and MoM ratios are useful for detecting anomalies in data with periodic patterns, such as daily active users (DAU) or ad‑click volume. When the ratio exceeds a predefined threshold, an anomaly is flagged.

1.3 STL + GESD for Time Series Anomaly Detection

STL decomposes a series into seasonal, trend, and residual components. GESD (Generalized Extreme Studentized Deviate) is then applied to the combined trend + residual, using median and MAD for robustness. The anomaly score is computed as (value − median)/MAD.

2. Statistics

2.1 Single Feature Gaussian

If a variable x follows a Gaussian distribution, its probability density function can be estimated from sample data to compute the likelihood of new observations.

2.2 Multiple Independent Gaussian Features

For n independent Gaussian features, each dimension’s mean and variance are computed, and the joint probability of a new n‑dimensional point is the product of individual probabilities.

2.3 Multivariate Gaussian

When features are correlated, the multivariate Gaussian is characterized by a mean vector μ and covariance matrix Σ. The probability of a new point x is evaluated using the Mahalanobis distance.

2.4 Mahalanobis Distance

For a data point x, the Mahalanobis distance d = √((x − μ)ᵀ Σ⁻¹ (x − μ)) measures how far x deviates from the distribution. Large distances indicate outliers.

2.5 Boxplot

Boxplot‑based detection does not assume any distribution. It computes Q1, Q3, and IQR = Q3 − Q1, then defines outlier bounds as Q3 + λ·IQR and Q1 − λ·IQR (commonly λ = 1.5).

3. Distance

3.1 Angle‑Based Outlier Detection

For each point X, the variance of angles formed with all other point pairs is computed. Smaller variance indicates that X lies in a dense region, while larger variance suggests an outlier. The algorithm runs in O(N³) time, suitable for small N.

3.2 K‑Nearest‑Neighbor (KNN) Outlier Detection

For each point X, the sum of distances to its K nearest neighbors (Dist(K,X)) is calculated. Larger sums indicate higher outlierness. Complexity is O(N · log N) for efficient nearest‑neighbor search.

4. Linear Methods (Matrix Decomposition and PCA)

PCA projects data onto principal components. Normal points are well reconstructed using the top‑j components, while outliers exhibit large reconstruction errors because they lie in directions captured by lower‑variance components.

5. Distribution

5.1 Kullback‑Leibler (KL) Divergence

KL divergence measures the distance between two probability distributions P and Q: Dₖₗ(P‖Q) = Σ P(i) log(P(i)/Q(i)). Zero indicates identical distributions; larger values indicate greater dissimilarity.

5.2 Chi‑Square Test

The chi‑square statistic χ² = Σ (O − E)² / E compares observed counts O with expected counts E to assess whether deviations are statistically significant.

6. Tree (Isolation Forest)

Isolation Forest recursively partitions data using random hyperplanes. Points in low‑density regions require fewer splits to become isolated, yielding shorter path lengths and higher anomaly scores.

7. Graph

7.1 Largest Connected Component

In an undirected graph, a connected component groups vertices that are mutually reachable. Identifying maximal connected subgraphs helps detect coordinated fraudulent behavior (e.g., multiple accounts sharing devices).

7.2 Label Propagation Clustering

Label propagation iteratively updates node labels based on neighboring labels, producing subgraphs with high internal connectivity and low inter‑connectivity.

8. Behavior Sequence (Markov Chain)

User actions (e.g., page request, search, ad click, pagination) form a Markov chain. Transition probabilities are estimated from normal behavior; the likelihood of a new sequence indicates its normality.

9. Supervised Models

9.1 Gradient Boosted Decision Trees (GBDT)

Samples mined by unsupervised methods serve as training data. If labeled cheating samples are scarce, techniques like SMOTE or GAN can generate synthetic fraud instances. GBDT is then trained and evaluated on conversion metrics.

9.2 Wide & Deep

Wide & Deep combines wide (high‑dimensional linear features) and deep (neural network) components to balance memorization and generalization. The model can be further enhanced by incorporating unsupervised‑derived fraud samples.

10. Other Issues

10.1 Common Threshold Selection

Unsupervised methods often use quantiles or distribution inflection points to set thresholds, while supervised models rely on validation precision‑recall curves.

10.2 Transforming Non‑Gaussian to Gaussian

Non‑Gaussian features can be transformed using functions such as log(x + c) or xᶜ (0 < c < 1) to approximate a Gaussian distribution before applying statistical tests.

References

Charu C. Aggarwal et al., *Outlier Analysis*, Springer, 2016.

Varun Chandola, Arindam Banerjee et al., “Anomaly Detection: A Survey,” *ACM Computing Surveys*, 2009.

Kalyan Veeramachaneni, Ignacio Arnaldo et al., “AI2: Training a big data machine to defend,” *Proc. HPSC and IDS*, 2016.

Liu, Fei Tony, Kai Ming Ting, Zhi‑Hua Zhou, “Isolation Forest,” *ICDM*, 2008.

Cheng H. T., Koc L., Harmsen J., et al., “Wide & Deep Learning for Recommender Systems,” *ACM Computing Surveys*, 2016.

SMOTE: Synthetic Minority Over‑sampling Technique, *JAIR*, 2002.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

machine learning statistics anomaly detection Time Series graph algorithms outlier detection Isolation Forest

Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Background

1. Time Series

1.1 Moving Average (MA)

1.1.1 Simple Moving Average (SMA)

1.1.2 Weighted Moving Average (WMA)

1.1.3 Exponential Moving Average (EMA)

1.2 Year‑over‑Year and Month‑over‑Month

1.3 STL + GESD for Time Series Anomaly Detection

2. Statistics

2.1 Single Feature Gaussian

2.2 Multiple Independent Gaussian Features

2.3 Multivariate Gaussian

2.4 Mahalanobis Distance

2.5 Boxplot

3. Distance

3.1 Angle‑Based Outlier Detection

3.2 K‑Nearest‑Neighbor (KNN) Outlier Detection

4. Linear Methods (Matrix Decomposition and PCA)

5. Distribution

5.1 Kullback‑Leibler (KL) Divergence

5.2 Chi‑Square Test

6. Tree (Isolation Forest)

7. Graph

7.1 Largest Connected Component

7.2 Label Propagation Clustering

8. Behavior Sequence (Markov Chain)

9. Supervised Models

9.1 Gradient Boosted Decision Trees (GBDT)

9.2 Wide & Deep

10. Other Issues

10.1 Common Threshold Selection

10.2 Transforming Non‑Gaussian to Gaussian

References

Alibaba Cloud Developer

How this landed with the community

Was this worth your time?

0 Comments

1.3 STL + GESD for Time Series Anomaly Detection

9.2 Wide & Deep