Artificial Intelligence 14 min read

Anomaly Detection in Operations: From Statistical Rules to Deep Learning with LSTM and Real‑Time Streaming

This article describes how a large‑scale operations monitoring system at Ctrip evolved from rule‑based alerts to a deep‑learning‑driven anomaly detection pipeline using LSTM models and Flink streaming, achieving a ten‑fold reduction in alert volume while improving fault detection.

Ctrip Technology

Dec 19, 2018

Anomaly Detection in Operations: From Statistical Rules to Deep Learning with LSTM and Real‑Time Streaming

The author, a senior data‑analysis manager at Ctrip, introduces the critical role of anomaly detection in operations, highlighting the trade‑off between false positives and false negatives and the high cost of manual labeling.

Traditional statistical methods and rule‑based alerts struggle with subjectivity, scalability, and maintaining thresholds across thousands of metrics, leading to persistent over‑alerting or missed faults.

To address these challenges, the team designed a new algorithmic solution that reduces alert volume without sacrificing detection of order‑impacting failures, emphasizing portability and real‑time performance.

They adopted a deep‑learning approach based on recurrent neural networks, specifically LSTM, which can capture long‑term temporal dependencies in time‑series monitoring data.

The offline training pipeline cleanses historical data, removes holidays, imputes missing values, and extracts multi‑scale sliding‑window features to categorize series into periodic, stationary, and non‑periodic groups before feeding them into TensorFlow‑based LSTM models.

During online detection, a sliding window of the most recent ten points is used; the model predicts the next five points and compares predictions with actual values, applying a set of handcrafted rules to decide anomalies, with sensitivity levels tuned for different metric types.

Real‑time deployment is achieved with Apache Flink, which processes Kafka streams, loads refreshed models from HDFS every five minutes, and returns detection results to a Kafka output, reducing end‑to‑end latency by about 40 seconds compared to batch‑oriented Python pipelines.

The solution demonstrates a roughly ten‑fold reduction in alert volume while slightly increasing fault detection rates, though limitations remain such as high resource consumption for per‑metric models, difficulty with low‑amplitude metrics, and challenges with non‑periodic random series.

The authors conclude by seeking a more universal model to handle larger metric sets and invite collaboration from the community.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Flink Deep Learning Real-time Streaming anomaly detection LSTM operations monitoring

Written by

Ctrip Technology

Official Ctrip Technology account, sharing and discussing growth.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.