Operations 24 min read

How Qunar Leverages AI‑Driven Fault Prediction and Health Management to Boost System Reliability

This article summarizes Zhang Yan's presentation at the 2019 Gdevops Global Agile Operations Summit, detailing Qunar's OPS goals, evolution of its automation platform, the adoption of PHM concepts from aerospace to internet services, and practical fault‑prediction workflows, metrics, and challenges for achieving higher availability.

dbaplus Community

Sep 2, 2019

How Qunar Leverages AI‑Driven Fault Prediction and Health Management to Boost System Reliability

1. OPS Goals & Work

At Qunar, the OPS team is responsible for reducing fault occurrence and enabling rapid fault recovery. The team distinguishes between "application effectiveness" (degrading or backing up a component to keep service alive) and true reliability, emphasizing the need to minimize failures rather than merely tolerate them.

Key objectives include:

Reducing the generation of failures.

Quickly locating, isolating, and resolving failures.

The classic availability formula Availability = MTBF / (MTBF + MTTR) is used to quantify these goals, aiming for multiple "9s" of uptime.

2. Qunar Ops Evolution

Qunar's ops team grew to about ten engineers managing over 50,000 VMs and 70,000 physical servers. Early processes relied on manual ticketing and email, which proved inefficient as traffic grew.

Key developments:

Built an internal CMDB (OPSDB) and a unified monitoring platform (Watcher).

Created a portal that consolidates CI/CD, monitoring, logging, and foundational services under a single authentication and authorization layer.

Introduced a global application identifier (AppCode) to link all resources and monitoring data to the owning application.

Implemented full‑life‑cycle management to retire unused applications and reclaim resources.

3. PHM (Prognostics & Health Management) Overview & Methodology

PHM originated in NASA’s flight health monitoring and has matured through aerospace projects such as JSF. It is now applied to high‑speed rail, bridges, and increasingly to internet services.

Applying PHM to the internet requires:

Alignment of goals (prevent failures, improve reliability).

Robust theoretical foundations (statistics, machine learning, AI).

Adequate big‑data processing and streaming capabilities.

The PHM workflow consists of four stages:

Data acquisition and feature extraction.

Status monitoring.

Fault diagnosis (detecting abnormal vs. normal states).

Predictive analysis using historical data and product models.

Four information sources feed the model:

Fault‑state information (monitoring alerts).

Abnormal‑phenomenon information (sudden metric spikes).

Operating‑environment information (network switches, deployment events).

Damage‑scale information (hardware wear‑out, warranty periods).

4. Qunar’s Practical Implementation

4.1 Fault‑Prediction Process

Data pipeline: collect business metrics, machine metrics, logs, and alerts → clean & filter → identify abnormal indicators → assess whether they may cause a fault → remove external noise → run predictive models → generate health status → notify responsible owners → close the feedback loop.

4.2 Indicator Selection

Indicators must be complete, objective, and effective. Qunar uses basic resource metrics (CPU, memory, disk usage) plus business KPIs (order volume, conversion rate). Logs from systems, middleware, and applications, as well as alert messages, are also ingested. Correlation analysis distinguishes static (provided by business) and dynamic (real‑time) relationships.

4.3 Prediction Strategies

Static thresholds : traditional fixed limits based on expert experience.

Dynamic thresholds : adapt to periodic fluctuations (e.g., diurnal traffic patterns).

Historical comparison : detect deviations from seasonal trends (e.g., holiday order spikes).

Predictive models : combine trend forecasting, short‑term anomaly detection, and event‑correlation analysis.

4.4 Model Types

The models are relatively simple, relying on:

Fault‑state signals from the alert system.

Abnormal‑phenomenon detection (e.g., sudden increase in connection count).

Environment changes (network re‑routing, host failures).

Damage‑scale data (hardware warranty, expected lifespan).

4.5 Supporting Mechanisms

Post‑incident review and remediation tracking.

Real‑time fault discovery with event correlation.

Knowledge‑base entry for each incident, enabling learning.

Health dashboards (43 panels) showing risk levels.

Health archives tracking application health over time.

Event timelines and topology maps for visualizing dependencies.

5. Outlook & Challenges

Applying PHM in the internet faces issues such as rapid business change, lack of established theory, poor communication, and insufficient technical governance. Qunar aims to bridge industry practices with academic research, iteratively refine methods, and eventually feed insights back to the aerospace domain.

6. Q&A Highlights

Key questions addressed include validation of prediction results (matching fault reports to predictions), handling of business‑driven metric spikes, data lifecycle management (compression of older metrics), and the balance between automated analysis and manual review.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Monitoring AIOps Health Management Qunar fault prediction PHM

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.