Mastering Anomaly Detection: From Moving Averages to Isolation Forests
This comprehensive guide explores a wide range of anomaly detection techniques—including time‑series methods, statistical models, distance‑based approaches, tree‑based isolation forests, graph algorithms, behavior‑sequence Markov models, and supervised machine‑learning models—detailing their principles, formulas, and practical scenarios for detecting outliers in advertising, fraud, and system monitoring.
Background
Outlier detection (also called anomaly detection) is the process of identifying data points that deviate significantly from expected behavior. Detected outliers are used in many applications such as credit‑card fraud prevention, industrial fault detection, and ad‑click anti‑cheating.
An outlier is a data object that is markedly different from other objects. In the illustration, points inside regions N1 and N2 are normal, while points O1, O2, O3 far from N1/N2 are outliers.
A major difficulty in anomaly detection is the lack of ground truth. A common practice is to first use unsupervised methods to mine suspicious samples, then fuse multiple features with supervised models for more comprehensive fraud detection.
1. Time Series
1.1 Moving Average (MA)
1.1.1 Simple Moving Average (SMA)
The simple moving average smooths a time series by averaging historical values within a window of size T. When the absolute difference between the current value and the SMA exceeds a threshold, the point is considered anomalous.
Applicable to: a. smoothing noisy data; b. forecasting future values.
1.1.2 Weighted Moving Average (WMA)
WMA assigns higher weights to more recent observations, making it more responsive to recent changes than SMA, while still retaining some lag.
1.1.3 Exponential Moving Average (EMA)
EMA uses exponentially decreasing weights, ensuring that even very old observations retain a small influence. The smoothing factor α (0 < α < 1) controls the decay rate; larger α makes the average react faster to recent changes.
1.2 Year‑over‑Year and Month‑over‑Month
YoY and MoM ratios are useful for detecting anomalies in data with periodic patterns, such as daily active users (DAU) or ad‑click volume. When the ratio exceeds a predefined threshold, an anomaly is flagged.
1.3 STL + GESD for Time Series Anomaly Detection
STL decomposes a series into seasonal, trend, and residual components. GESD (Generalized Extreme Studentized Deviate) is then applied to the combined trend + residual, using median and MAD for robustness. The anomaly score is computed as (value − median)/MAD.
2. Statistics
2.1 Single Feature Gaussian
If a variable x follows a Gaussian distribution, its probability density function can be estimated from sample data to compute the likelihood of new observations.
2.2 Multiple Independent Gaussian Features
For n independent Gaussian features, each dimension’s mean and variance are computed, and the joint probability of a new n‑dimensional point is the product of individual probabilities.
2.3 Multivariate Gaussian
When features are correlated, the multivariate Gaussian is characterized by a mean vector μ and covariance matrix Σ. The probability of a new point x is evaluated using the Mahalanobis distance.
2.4 Mahalanobis Distance
For a data point x, the Mahalanobis distance d = √((x − μ)ᵀ Σ⁻¹ (x − μ)) measures how far x deviates from the distribution. Large distances indicate outliers.
2.5 Boxplot
Boxplot‑based detection does not assume any distribution. It computes Q1, Q3, and IQR = Q3 − Q1, then defines outlier bounds as Q3 + λ·IQR and Q1 − λ·IQR (commonly λ = 1.5).
3. Distance
3.1 Angle‑Based Outlier Detection
For each point X, the variance of angles formed with all other point pairs is computed. Smaller variance indicates that X lies in a dense region, while larger variance suggests an outlier. The algorithm runs in O(N³) time, suitable for small N.
3.2 K‑Nearest‑Neighbor (KNN) Outlier Detection
For each point X, the sum of distances to its K nearest neighbors (Dist(K,X)) is calculated. Larger sums indicate higher outlierness. Complexity is O(N · log N) for efficient nearest‑neighbor search.
4. Linear Methods (Matrix Decomposition and PCA)
PCA projects data onto principal components. Normal points are well reconstructed using the top‑j components, while outliers exhibit large reconstruction errors because they lie in directions captured by lower‑variance components.
5. Distribution
5.1 Kullback‑Leibler (KL) Divergence
KL divergence measures the distance between two probability distributions P and Q: Dₖₗ(P‖Q) = Σ P(i) log(P(i)/Q(i)). Zero indicates identical distributions; larger values indicate greater dissimilarity.
5.2 Chi‑Square Test
The chi‑square statistic χ² = Σ (O − E)² / E compares observed counts O with expected counts E to assess whether deviations are statistically significant.
6. Tree (Isolation Forest)
Isolation Forest recursively partitions data using random hyperplanes. Points in low‑density regions require fewer splits to become isolated, yielding shorter path lengths and higher anomaly scores.
7. Graph
7.1 Largest Connected Component
In an undirected graph, a connected component groups vertices that are mutually reachable. Identifying maximal connected subgraphs helps detect coordinated fraudulent behavior (e.g., multiple accounts sharing devices).
7.2 Label Propagation Clustering
Label propagation iteratively updates node labels based on neighboring labels, producing subgraphs with high internal connectivity and low inter‑connectivity.
8. Behavior Sequence (Markov Chain)
User actions (e.g., page request, search, ad click, pagination) form a Markov chain. Transition probabilities are estimated from normal behavior; the likelihood of a new sequence indicates its normality.
9. Supervised Models
9.1 Gradient Boosted Decision Trees (GBDT)
Samples mined by unsupervised methods serve as training data. If labeled cheating samples are scarce, techniques like SMOTE or GAN can generate synthetic fraud instances. GBDT is then trained and evaluated on conversion metrics.
9.2 Wide & Deep
Wide & Deep combines wide (high‑dimensional linear features) and deep (neural network) components to balance memorization and generalization. The model can be further enhanced by incorporating unsupervised‑derived fraud samples.
10. Other Issues
10.1 Common Threshold Selection
Unsupervised methods often use quantiles or distribution inflection points to set thresholds, while supervised models rely on validation precision‑recall curves.
10.2 Transforming Non‑Gaussian to Gaussian
Non‑Gaussian features can be transformed using functions such as log(x + c) or xᶜ (0 < c < 1) to approximate a Gaussian distribution before applying statistical tests.
References
Charu C. Aggarwal et al., *Outlier Analysis*, Springer, 2016.
Varun Chandola, Arindam Banerjee et al., “Anomaly Detection: A Survey,” *ACM Computing Surveys*, 2009.
Kalyan Veeramachaneni, Ignacio Arnaldo et al., “AI2: Training a big data machine to defend,” *Proc. HPSC and IDS*, 2016.
Liu, Fei Tony, Kai Ming Ting, Zhi‑Hua Zhou, “Isolation Forest,” *ICDM*, 2008.
Cheng H. T., Koc L., Harmsen J., et al., “Wide & Deep Learning for Recommender Systems,” *ACM Computing Surveys*, 2016.
SMOTE: Synthetic Minority Over‑sampling Technique, *JAIR*, 2002.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
