Anomaly Detection and Root Cause Analysis System for Ctrip Train Ticket Business Metrics
This article presents an AI‑driven system that automatically detects anomalies in over 1,000 Ctrip train‑ticket business metrics using six unsupervised algorithms and locates their root causes through a hard‑voting ensemble of four specialized methods, demonstrating practical results and future enhancements.
Author : Long Chuanjing, Ctrip algorithm engineer focusing on anomaly detection, root cause analysis, and time‑series forecasting.
Abstract : Ctrip train‑ticket services monitor more than 1,000 business indicators, making manual anomaly inspection costly and rule‑based methods insufficient for diverse and evolving metrics. An AI solution is proposed to fully automate monitoring, detect abnormal indicators, and identify their potential causes.
Background : Rapid growth of train‑ticket business creates numerous metrics with complex seasonal and stable patterns. Timely detection of abnormal spikes or drops and rapid root‑cause identification are critical for maintaining service quality.
Metric Characteristics : Indicators exhibit two main time‑series types—(1) periodic patterns driven by travel habits and (2) stationary patterns with little trend. External factors such as promotions, coupons, or pandemic‑related policies can cause abrupt fluctuations.
Key Pain Points : (1) Rule‑based detection cannot scale to 1,000+ metrics and lacks adaptability to new services. (2) Seasonal variability, especially during pandemic‑induced travel changes, hampers accurate detection. (3) High‑dimensional metrics require extensive manual analysis to pinpoint the exact dimension causing an anomaly.
Anomaly Detection Sub‑system : Six unsupervised algorithms—LOF, KNN, CBLOF, COF, IForest, and PCA—are selected based on their performance for local and global outliers. Each algorithm computes an anomaly score; thresholds are derived using Z‑score with elbow method for low‑skewed distributions or box‑plot for high‑skewed distributions. A hard‑voting mechanism aggregates results, with voting weights adjusted by metric importance.
Root‑Cause Localization Sub‑system : Four algorithms—Adtributor, HotSpot, Squeeze, and Psqueeze—are combined in a hard‑voting ensemble. The system first builds a multidimensional data cube by splitting metrics into fine‑grained dimensions, forecasting values (EWMA for stationary series, median of recent points for non‑stationary), and comparing actual versus expected values.
Adtributor evaluates each dimension using EP (explanatory power) and S (surprise) scores; HotSpot employs a potential score with Monte‑Carlo Tree Search; Squeeze adds a generalized ripple‑effect principle and clustering; Psqueeze extends Squeeze with probabilistic clustering based on GRE.
Practical Results : A case study on ticket refunds during extreme weather (Typhoon Doksuri) shows the system correctly identifying abnormal spikes and attributing them to specific departure and arrival cities. Manual verification yields 67% precision, 83% recall, and a 74% F1‑score. Thresholds are tuned to favor recall for critical metrics.
Summary and Outlook : The system effectively detects anomalies in both stationary and strongly periodic series and locates root causes for low‑dimensional factors. Future work includes improving detection for drifted, mixed, and gradual trends, expanding coverage to non‑core metrics, and enhancing multi‑dimensional root‑cause accuracy.
References : Includes recent papers on anomaly detection benchmarks, Adtributor, HotSpot, and robust root‑cause localization methods.
Ctrip Technology
Official Ctrip Technology account, sharing and discussing growth.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.