Automated Root Cause Analysis for Flight Ticket Transaction Interception at Qunar: Design, Algorithm, and Performance Optimizations
This article describes how Qunar implemented an automated root‑cause analysis system for flight‑ticket transaction interception, detailing the problem background, system research, a custom algorithm focusing on explanatory power, performance optimizations that reduced analysis time from five minutes to under ten seconds, and the resulting operational improvements.
Author Introduction Li Lianyong joined Qunar in March 2019, leading the domestic ticket transaction team and bringing extensive experience in high‑concurrency backend systems.
Background The ticket transaction interception rate reflects user experience; errors can stem from over 280 error codes and dozens of business dimensions, leading to millions of possible error combinations. Manual analysis was slow and error‑prone, causing high team pressure.
System Research Root‑cause automation is challenging; existing solutions rely on machine‑learning dimensionality reduction or heuristic search. Qunar chose an algorithmic approach, focusing on two key questions: how to embed root‑cause information in analysis data and how to systematize accumulated manual experience.
Data and Experience Modeling Root‑cause data is defined as business attributes plus system error codes. The analysis seeks the dimension with the largest pre‑/post‑fault variation. Unlike Adtributor, the proposed method uses explanatory power and surprise metrics, ultimately simplifying to a correlation‑based approach.
Overall Solution The architecture ensures that interception data carries the lowest‑level root‑cause information through layered transmission. Custom error‑code designs improve interpretability and support logical dimensionality reduction.
Algorithm Details The algorithm identifies the fine‑grained dimension with the greatest correlated change, assuming stable total volume and single‑issue faults. It simplifies the problem to finding the dimension with the highest difference‑ratio, enabling fast pruning.
Performance Optimization To meet sub‑minute alert requirements, query volume was reduced via aggregation, data was pre‑loaded, and Calcite was used for in‑memory SQL processing. These changes cut analysis time from ~5 minutes to under 10 seconds.
Business Data Abstraction Business dimensions (e.g., airline, city, product type) and metrics (success count, failure count, pass rate) are abstracted to support both specific and generic analysis scenarios.
Metric Calculation & Dynamic Display Aviator expression engine computes derived metrics, while the AMIS low‑code front‑end renders dynamic visualizations, allowing new scenarios to be onboarded within ~2 hours.
Main Features 1. Automatic root‑cause analysis triggered by alerts, with results pushed to stakeholders. 2. Result dashboards offering case lookup, manual analysis, and trend views.
Implementation Effects Root‑cause analysis efficiency improved by 80%; alert‑driven analysis achieves >95% accuracy within 1 minute. The platform handles over 1,000 personalized analyses daily, boosting productivity by an estimated 1,500 person‑days per year.
Future Directions The team proposes a three‑step data‑analysis maturity model to further automate decision‑making and enhance analytical intelligence.
References
https://zhuanlan.zhihu.com/p/490229751
https://zhuanlan.zhihu.com/p/345569713
https://zhuanlan.zhihu.com/p/344900818
https://tech.meituan.com/2019/02/28/root-clause-analysis.html
https://blog.csdn.net/weixin_35834894/article/details/95181483
Calcite documentation: https://calcite.apache.org/
Aviator documentation: https://www.yuque.com/boyan-avfmj/aviatorscript
Qunar Tech Salon
Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.