How AI Detects and Diagnoses Anomalies in Ctrip Train Ticket Metrics
This article presents a comprehensive AI‑driven system for automatically detecting anomalies in over 1,000 Ctrip train‑ticket business metrics and pinpointing their root causes, detailing the background, unsupervised algorithms, detection and attribution pipelines, practical results, and future improvements.
Background
Ctrip’s train‑ticket business monitors over 1,000 KPI metrics. Manual rule‑based monitoring cannot keep up with the volume, diversity, and rapid business changes. The objective is to replace manual checks with an AI‑driven pipeline that automatically detects metric anomalies and pinpoints the fine‑grained dimensions that cause them.
Anomaly Detection System
Algorithms
Six unsupervised detectors are used:
LOF (Local Outlier Factor) – excels at local outliers.
KNN – global nearest‑neighbor distance, best overall global detection.
CBLOF – clustering‑based outlier factor.
COF – connectivity‑based outlier factor.
IForest – isolation forest.
PCA – reconstruction error from principal components.
Detection Pipeline
Time‑series analysis –
If the series passes a stationarity test, global algorithms are applied directly.
If periodicity is detected, STL decomposition extracts trend, seasonal, and residual components; the residual is fed to the detectors.
Otherwise the series is transformed into a probability distribution; the distribution’s skew determines the threshold method.
Anomaly score calculation – Two algorithm groups are formed:
Global group: {KNN, IForest, PCA, CBLOF}
Local group: {LOF, KNN, COF, CBLOF}
Each algorithm outputs a raw anomaly score for the series.
Threshold determination –
Low‑skew, near‑symmetric distributions: compute absolute Z‑scores, sort descending, and apply the elbow method to select a cut‑off.
High‑skew distributions: use a box‑plot (Q1, Q3, IQR) to define upper outlier bounds.
Final anomaly decision is made by hard voting; votes are weighted by metric importance (e.g., P0 metrics receive lower thresholds than P2).
Root‑Cause Localization System
When an anomaly is flagged, the system searches the exponential space of dimension‑value subsets to find the subset that best explains the deviation.
Data Construction
Metrics are split into the most granular dimension combinations (e.g., city × app channel × order type). Historical values are collected for each combination. Prediction methods depend on series type:
Stationary series – Exponential Weighted Moving Average (EWMA).
Non‑stationary series – median of the three preceding points.
The real and predicted values form a three‑dimensional data cube (dimension × time × value) that serves as input for root‑cause algorithms.
Algorithms and Ensemble
Four algorithms are combined via hard voting:
Adtributor – assumes a single dimension is responsible; computes Explanatory Power (EP) and Surprise (S) scores for each dimension.
HotSpot – defines a potential score , handles multiple simultaneous causes, and searches the space with Monte‑Carlo Tree Search (MCTS) plus hierarchical pruning.
Squeeze – extends HotSpot with a generalized ripple‑effect principle, a refined potential score, and clusters fine‑grained attribute groups before search.
Psqueeze – further extends Squeeze by using GRE‑based probabilistic clustering and a General Potential Score (GPS) .
Practical Results
A case study on extreme weather (Typhoon “Doksuri”) showed sudden spikes in ticket refunds. The anomaly detection pipeline correctly flagged the abnormal days. The root‑cause system identified the affected city dimensions (e.g., Beijing, Xiamen) as the primary drivers of the spikes.
Manual verification of the top‑ranked dimensions matched the algorithm’s output, confirming the system’s accuracy.
Performance Metrics and Outlook
On core metrics the system achieved:
Precision = 67 %
Recall = 83 %
F1‑score = 74 %
Future work includes:
Improving detection for drifting series (e.g., stationary + drift, periodic + drift).
Adding support for gradual upward/downward trends and long‑term shifts.
Enhancing multi‑dimensional root‑cause accuracy for very large attribute spaces (e.g., >10,000 element combinations).
References
Han S, et al. “Adbench: Anomaly detection benchmark.” NeurIPS 2022.
Bhagwan R, et al. “Adtributor: Revenue debugging in advertising systems.” NSDI 2014.
Sun Y, et al. “Hotspot: Anomaly localization for additive KPIs with multi‑dimensional attributes.” IEEE Access 2018.
Li Z, et al. “Generic and robust localization of multi‑dimensional root causes.” ISSRE 2019.
Li Z, et al. “Generic and robust root cause localization for multi‑dimensional data in online service systems.” JSS 2023.
Code example
相关阅读:Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Architect
Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
