Operations 9 min read

How LSTM, k‑means, and Probability Density Power Intelligent Anomaly Detection in AIOps

This article explains how WeBank’s intelligent operations team combines LSTM‑based forecasting with Gaussian analysis, k‑means feature clustering, and probability‑density modeling to automatically detect and warn about anomalies in key business metrics, moving beyond traditional threshold‑based monitoring.

Efficient Ops
Efficient Ops
Efficient Ops
How LSTM, k‑means, and Probability Density Power Intelligent Anomaly Detection in AIOps

Intelligent Operations (AIOps) integrates big data and machine learning to process the ever‑growing volume, variety, and velocity of IT data, providing a foundation for modern IT operations management. Based on frontline experience, WeBank’s intelligent operations team authored a series of articles to share their practice.

LSTM and Gaussian Detection

The method consists of two steps: curve prediction and anomaly judgment. After evaluating ARIMA, Holt‑Winter, and LSTM, the team selected LSTM because it captures long‑range dependencies and yields very low loss (~0.0001) on normalized data. In normal operation the predicted line closely matches the actual curve; deviations indicate potential anomalies.

Empirical analysis shows that the difference between prediction and reality follows a Gaussian distribution, so points in the distribution’s tail (low probability) are treated as anomalies. This approach excels at detecting sudden spikes, as illustrated in the figures, but struggles with small‑amplitude, long‑duration changes because the LSTM prediction line can be pulled toward the anomaly.

k‑means Feature Detection

To address the LSTM’s blind spot, a supplementary algorithm based on k‑means clustering is introduced. Four features are extracted from each sliding window: mean, slope, zero‑value rate, and the RMS of first‑order differences. For high‑frequency metrics, mean and slope highlight slow changes; for low‑frequency metrics, zero‑rate and RMS are more indicative.

During detection, the current window and its neighboring and historical windows are clustered with k=2. If the current window falls into a separate cluster, it is considered anomalous, following the “rarity = anomaly” principle.

Probability Density Detection

Success‑rate curves pose a challenge because a single success‑rate value does not reflect underlying transaction volume. The method models the distribution of successful transaction counts given the overall success rate (e.g., 95%). By computing the cumulative probability of observing a success count lower than a threshold, events with near‑zero probability are flagged as anomalies.

This approach correctly distinguishes between a 0% success rate after one transaction (possible) and a 50% success rate after thirty transactions (highly unlikely), thereby improving detection for success‑rate metrics.

Overall, the framework adheres to the principle that rare events indicate anomalies. Properly setting the probability threshold balances false positives and missed alerts.

The algorithms are summarized in the diagram below.

machine learningoperationsAnomaly DetectionAIOpsLSTMk-meansProbability Density
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.