Real‑Time Intelligent Anomaly Detection Platform at Ctrip: Integrating Flink and TensorFlow (Prophet)

The article describes Ctrip's Prophet platform, which combines Flink real‑time stream processing with TensorFlow deep‑learning models to provide intelligent, low‑latency anomaly detection, replacing traditional rule‑based alerts and addressing challenges such as holiday traffic and model scalability.

DataFunTalk
DataFunTalk
DataFunTalk
Real‑Time Intelligent Anomaly Detection Platform at Ctrip: Integrating Flink and TensorFlow (Prophet)

The talk by Pan Guoqing (Ctrip Big Data R&D Manager) introduces Prophet, a one‑stop anomaly detection solution that merges real‑time stream computing (Flink) with deep‑learning (TensorFlow) to replace rule‑based monitoring.

Traditional rule alerts rely on statistical thresholds (e.g., YoY, MoM) and suffer from complex configuration, poor effectiveness, and high maintenance cost. Ctrip operates dozens of monitoring platforms, making configuration cumbersome, prompting the creation of Prophet.

Prophet’s key features include:

Support for time‑series data.

Smart alerts that replace all rule‑based alarms.

Deep‑learning algorithms for intelligent anomaly detection.

Real‑time warning powered by a stream‑processing engine.

The system architecture is layered:

Bottom layer (Hadoop): YARN schedules resources, running Flink jobs; HDFS stores TensorFlow‑trained models.

Middle layer (Engine): Kafka buffers real‑time data; Flink performs streaming computation; TensorFlow provides the training engine; a time‑series database stores intermediate results.

Top layer (Service): Clog collects job logs, Muise is the real‑time compute platform, Qconfig supplies configuration, and Hickwall offers simple monitoring.

Flink is chosen for its efficient state management, rich windowing (including sliding windows used by Ctrip), event‑time semantics, and fault‑tolerance guarantees.

Prophet’s operational workflow is transparent to users: they configure alerts on their existing monitoring platform, select “smart alert,” and Prophet handles model training, Kafka ingestion, Flink inference, and writes back anomaly results to Kafka for downstream consumption.

Key challenges include scarce negative samples, diverse metric types, and varying periodicity. Various deep‑learning models are explored:

RNN/LSTM: one model per metric, high accuracy but resource‑intensive.

DNN: a single model for all metrics, lower accuracy but broader coverage.

Model training occurs bi‑weekly, with data preprocessing (handling nulls, holiday adjustments) and feature extraction (time‑series and frequency features) before feeding TensorFlow.

Trained models are uploaded to HDFS; Flink jobs fetch the latest model from the configuration center, distribute it across TaskManagers, and perform real‑time inference on Kafka streams.

Real‑time consumption uses Flink event‑time sliding windows (e.g., 10‑minute windows) to predict the next point based on recent observations; missing data are imputed using mean or standard deviation, and anomalous intervals are replaced with previous predictions.

Detection logic combines three criteria:

Abnormal type and sensitivity (high, medium, low).

Deviation between predicted and actual values.

Comparison with historical mean and standard deviation.

Results show Prophet achieves ~90% hit rate versus 74% for rule‑based alerts, reduces alert volume by 5‑10×, and covers 95% of business‑critical metrics with millisecond‑level latency.

Future work includes expanding DNN usage, refining holiday‑alignment algorithms, increasing platform coverage to 70‑80% of monitoring systems, and building a self‑monitoring intelligent alert system for Flink jobs.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Flinkstream processingaiDeep Learninganomaly detectionTensorFlow
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.