Intelligent Alert Attribution System for Ctrip Hotel Frontend: Design, Implementation, and Outcomes
This article details the design and deployment of an intelligent alert attribution system for Ctrip Hotel's front‑end, describing the background challenges, the unified data pool, weighted alert rules, three attribution algorithms, achieved improvements in accuracy and troubleshooting speed, and future enhancement plans.
Facing a rapidly growing number of front‑end monitoring alerts, Ctrip Hotel introduced an intelligent alert attribution system that unifies data structures via a data pool, ensures accurate low‑noise alerts through a rule pool, and employs algorithmic models for root‑cause analysis to boost overall troubleshooting efficiency.
Background : Existing monitoring covered over 30 alert types (e.g., page slow load, white screen) with isolated data processing and alerting pipelines, leading to fragmented data islands, high maintenance costs, and labor‑intensive root‑cause investigations.
Overall Solution : The system integrates disparate data sources into a unified data pool, extracts a comprehensive alert rule library, and applies weighted rules and machine‑learning models to produce precise attribution reports and actionable alert notifications.
Data Pool : Six dimensions (data type, platform, metric, core monitoring dimension, base fields, business‑specific fields) standardize and centralize data, supporting algorithm inputs, rule tables, and bad‑case analysis for monitoring alerts.
Alert Rules & Weighting : Data is traffic‑graded (high, medium, low, watch‑only) and multiple rule types (single‑day same/环比, multi‑day trend, composite traffic, quantile) are applied. Each rule’s weight is computed as weight = traffic‑grade coefficient × change‑rate , and aggregated to produce daily alerts with reduced noise.
Model Attribution :
Adtributor algorithm (Microsoft Research, 2014) performs multidimensional time‑series root‑cause analysis using EP (explanatory power) and S (surprise) values.
Pearson Correlation Coefficient measures similarity of PV/UV trends to locate deep causes.
Moving T‑Test (MTT) detects abrupt changes in minute‑level PV/UV and matches them with experiment or business change timestamps.
The results from these models are filtered and combined based on root‑cause categories to generate consolidated attribution outputs.
Results :
Alert accuracy improved from ~60% to ~89%.
Average troubleshooting time reduced by ~40%.
Data pool now includes 40+ data types and supports 19 traditional alerts.
Over 70 issues identified, e.g., memory‑related white screens and experiment‑driven image load slowdowns.
Future Plans : Enrich alert rules, integrate more traditional alerts, expand to multi‑metric dashboards, and connect single‑day alerts to bug‑tracking systems.
Ctrip Technology
Official Ctrip Technology account, sharing and discussing growth.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.