Operations 18 min read

How Qunar Uses AI-Driven Fault Prediction to Boost System Reliability

This article outlines Qunar's operational strategy for reducing failures and extending uptime through precise fault detection, rapid recovery, and AI-powered predictive health management, detailing the evolution of their OPS processes, practical implementations, and future challenges in applying PHM to internet services.

Efficient Ops
Efficient Ops
Efficient Ops
How Qunar Uses AI-Driven Fault Prediction to Boost System Reliability

1. OPS Goals

Qunar's operations aim to minimize application failures and keep services in an effective state as defined by business lines, which consider degraded or backup modes still effective but risky for OPS.

OPS must identify, control, and repair failures, establishing clear fault definitions and rapid recovery steps: locate, isolate, and resolve.

The core mission is to improve system reliability and availability by extending Mean Time Between Failures (MTBF) and shortening Mean Time To Repair (MTTR).

2. Qunar Ops Evolution

Qunar's ops progressed through three stages: manual semi‑automation with ticket/email reviews; automation with a unified CMDB (OPSDB) and monitoring platform; and a portal integrating resources, CI/CD, monitoring, logs, and appcode for full‑lifecycle management.

Post‑incident reviews identify root causes, enforce corrective actions, and store fault records for statistical analysis and knowledge learning.

3. Fault Prediction

Fault prediction, a key focus, draws on PHM concepts from aerospace and industrial sectors, using extensive sensor data, preprocessing, feature extraction, and diagnostic models to anticipate failures.

Typical models include state‑based, anomaly‑based, environment‑based, and damage‑index approaches, often implemented with machine‑learning algorithms for trend prediction and anomaly detection.

Effective prediction requires timeliness (ideally before failure or within one minute), economic viability (costs lower than failure loss), and verifiable evaluation.

4. Qunar Practice

The fault‑prediction workflow consists of metric collection (machine, business alerts, logs), preprocessing (smoothing, noise removal), fault diagnosis (identifying leading indicators), prediction, notification via QTalk or SMS, and post‑resolution feedback.

Prediction indicators must be complete, objective, truthful, and effective; business‑critical metrics like order volume and payment rates are especially valuable.

Prediction methods combine static thresholds, dynamic adjustments, and trend analysis, with models focusing on trend forecasting and anomaly detection.

Feedback loops close the cycle by evaluating prediction accuracy, improving mechanisms, and updating knowledge bases.

5. Outlook and Challenges

Applying PHM to the internet faces fast‑changing business, lack of theoretical support, limited knowledge sharing, and governance issues; Qunar aims to blend industrial theory with practical internet ops to create a sustainable methodology that can also benefit the broader industry.

monitoringoperationssystem reliabilityAIOpsFault PredictionPHM
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.