Intelligent Fault Prediction and Application Health Management Practices at Qunar
This article presents the goals, methods, and evolution of Qunar's operations team in reducing application failures, improving reliability through fault definition, rapid repair, automation, and AIOps-driven fault prediction, while sharing lessons from industrial PHM and outlining future challenges.
Author Bio
Miao Hongtao works in the Technical Support Department of Qunar.com’s Website Operations Center.
At GOPS 2019 Shenzhen, Miao presented “Intelligent Fault Prediction and Application Health Management Practice”, and this article is a整理 of that talk.
OPS Goal
The two main goals of OPS are to reduce the occurrence of application faults and to keep applications in an effective state for as long as possible.
Effectiveness is defined by the business line and often includes degraded service or backup operation; such a state is still considered effective but is dangerously close to a fault.
For example, a car that brakes hard and bursts a tire has stopped, but the tire failure remains.
OPS must identify, control, and repair such failures, defining fault standards and reliability measurement methods.
Once a fault is defined, rapid repair should restore the application to its standard capability, involving fault location, isolation, and resolution.
The fundamental mission of operations is to improve system and application reliability and availability.
Availability is measured by the well‑known “nines” metric, which we refer to simply as “reliability”.
The first key metric is Mean Time Between Failures (MTBF) , meaning the longer a system runs without failure, the more reliable it is.
The second key metric is Mean Time To Repair (MTTR) ; shorter repair times directly improve availability.
We discuss the various “nines” (e.g., 99.9%, 99.99%) and note that higher levels become increasingly risky.
Faults are divided into occurred faults and potential (un‑occurred) faults. For occurred faults, OPS can only focus on rapid repair.
Fast repair relies on three experiences: precise fault location, fault isolation, and rapid mobilization of relevant personnel.
Precise location requires mapping complex software‑hardware dependencies, middleware, databases, networks, and human errors.
Monitoring and alerting are the foundation; Qunar’s monitoring covers system metrics, business metrics, and log collection/analysis.
Isolation prevents fault spread, using service degradation, rate limiting, circuit breaking, as well as hardware and network isolation.
Critical business (orders, payments, conversion) is isolated first; Qunar uses an internal communication tool QTalk and a fault‑robot to create groups and coordinate response.
If a fault exceeds the response time limit, escalation to higher management occurs, and a predefined SOP ensures anyone can resolve known issues from documentation.
Automation is essential; fully automated, one‑click tools automatically report status, trigger failover, and perform rapid scaling without human intervention.
For example, when an IDC network fault is detected, traffic is automatically switched to another data center, and after recovery the traffic switches back.
These steps form the core of handling occurred faults.
Predictive fault handling aims to kill faults before they appear, using capacity prediction, disk space, SD‑card life, bandwidth usage, etc.
Although AI/ML (AIOps) is widely discussed, practical, step‑by‑step work remains the reliable path for operations.
Qunar Operations Evolution
Qunar’s ops journey consists of three stages:
1. Manual‑Semi‑Automatic Stage – business submits tickets via email, OPS handles them with scripts and notifies business; high communication cost and low knowledge retention.
2. Operations Automation Stage – unified resources, CMDB (OPSDB), own monitoring platform, automated tools, workflow approvals, and internal QTalk communication.
3. Qunar Portal Platform – centralized management of resources, CI/CD, monitoring, logs, unified authentication, and an appcode that uniquely identifies each application throughout its lifecycle.
After each fault, a review identifies root causes and corrective actions to prevent recurrence, with supervision to ensure implementation within a deadline.
All faults are catalogued for statistical analysis, helping identify high‑risk services, frequent failure points, and personnel performance.
PHM (Prognostics and Health Management) from industry has been applied for decades; its theory—mass sensor data collection, preprocessing, feature extraction, state monitoring, and fault decision—matches the needs of internet operations.
Qunar leverages mature big‑data stream processing and machine‑learning techniques to build its own predictive practice.
Typical fault‑prediction models include state‑based, anomaly‑based, environment‑based, and damage‑index models; combining them can anticipate failures and trigger preventive actions.
Evaluation criteria for prediction are timeliness (detect before fault or within one minute), economic cost (prediction cost < fault loss), and verifiability (ability to validate that a predicted fault was avoided).
Qunar’s Fault‑Prediction Process
Metric Collection : gather machine metrics, alarms, business alarms, and logs.
Data Pre‑processing : smooth, de‑spike, filter transient warnings, retain important features.
Fault Diagnosis : identify indicators that directly indicate an imminent fault.
Fault Prediction : analyze root causes of diagnosed indicators to predict upcoming failures.
Notification : inform responsible owners via QTalk or SMS.
Feedback : after resolution, collect user feedback to evaluate prediction accuracy and timeliness.
Prediction indicators must satisfy completeness, objectivity, authenticity, and effectiveness.
Prediction methods include static thresholds, dynamic thresholds adjusted periodically, composite strategies, and machine‑learning models (trend forecasting and anomaly detection).
All predictions are stored in a knowledge base; after a fault is resolved, the system matches it against predictions to close the loop.
Feedback mechanisms include top‑down standards, open channels, rapid response, and continuous improvement.
The health dashboard shows predicted problematic applications together with related alarms and operational practices.
A topology graph built from NG and double logs visualizes service dependencies.
Appcode provides a global, hierarchy‑free identifier for each application, linking monitoring, alarms, and events.
Applications and events are classified into four levels based on business impact and health influence.
Alarm quality suffers from over‑setting, invalid alerts, and outdated rules; linking alarms to appcode clarifies source, owner, and manager, and regular training improves handling.
Standardized fault reporting forms, automated fault‑robot, and defined review principles ensure each incident is archived for later analysis and training.
Future Outlook and Challenges
Four main challenges for applying PHM in the internet industry are rapid business change, lack of theoretical support, insufficient communication, and technology governance issues.
Our thinking is to combine industrial theory with internet‑specific methodology, continuously experiment, refine, and eventually feed back improvements to the industrial sector.
Note: This article is a translation of Miao Hongtao’s talk at GOPS 2019 Shenzhen.
Additional notice: The DevOps International Summit 2019 Beijing will be held on July 5‑6.
Qunar Tech Salon
Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.