How Qunar Built a 5‑Million‑Metric Radar System to Cut Ticket Failures by 87%
This article details the design, implementation, and results of Qunar's intelligent ticket‑monitoring Radar system, covering the business need, architecture, anomaly‑detection algorithms, test‑set construction, parameter tuning, and the achieved 87% detection accuracy with future plans for large‑model integration.
Background and Motivation
In the digital era, Qunar's ticket team faced an explosion of monitoring metrics—hundreds of thousands of business metrics and millions of system metrics—making manual alerting infeasible. In 2022, only 38% of incidents were caught by existing alarms, while another 38% went unnoticed, resulting in a 50% alarm‑loss rate. An intelligent, data‑driven alert system became essential.
Value Analysis
The goal of the Radar system is to cover over 50,000 core ticket metrics and all app‑code error metrics with an accuracy above 75%, while reducing human effort and enabling sustainable operations.
Coverage: Focus on "golden" metrics selected by R&D, product, and testing teams, plus generic metrics such as error logs and middleware health.
Accuracy Quantification: Build a quantitative model to evaluate detection precision, using both fault‑specific and IVR‑alarm test sets.
Cost Reduction: Automate monitoring to lower manual labor and ensure long‑term stability.
Radar System Architecture
The new Radar model consists of five core modules—Data Ingestion, Feature Extraction, Metric Classification, Anomaly Detection, and Alert Trigger—plus an Accuracy Validation module.
Feature Extraction: Compute statistics (max, min, avg, variance, period, etc.) from the past seven days of each metric.
Metric Classification: Use extracted features and business knowledge to categorize metrics into five business types (error, rate_fail, success, rate_success, count) and five waveform types (stable, periodic, low‑volume, discrete, jitter).
Anomaly Detection: Apply multiple algorithms tailored to waveform type:
Continuous waveforms: BoxPlot, KDE, and custom density‑based rules.
Discrete waveforms: Density‑STD/AVG thresholds.
Sharp spikes/drops: Trend‑based steep‑increase/steep‑decrease models.
Each algorithm’s parameters are automatically tuned using the constructed test sets.
Testing and Parameter Tuning
Two test sets were built:
Fault Test Set: Small, clearly abnormal data aiming for 100% alarm accuracy.
IVR Alarm Test Set: Collected alarms from 12 ticket business lines, manually verified, and persisted for long‑term evaluation. Parameters such as densityStd*5 + densityAvg were iteratively adjusted, achieving up to 90% accuracy on the fault set and 100% on the IVR set after tuning.
Results and Future Plans
From January to March 2024, the Radar system achieved an average detection accuracy of 87%, discovering about ten online issues per week with no new low‑volume failures. The system identified 65% of incremental issues as low‑volume problems, with an average detection latency of 15 minutes. Future work includes integrating large‑scale language models to improve metric classification, noise reduction, and alarm suppression, further boosting precision and handling the growing data volume.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
