Optimizing Qunar's Monitoring System for Faster Fault Detection and Root‑Cause Analysis
This article details Qunar's comprehensive overhaul of its monitoring platform—introducing second‑level metrics, redesigning storage with VictoriaMetrics, optimizing client and server data collection, and building a root‑cause analysis tool—to dramatically reduce order‑related fault discovery time from minutes to under one minute.
Qunar's original monitoring system handled billions of metrics and millions of alerts but struggled with fault detection, averaging four minutes for order‑related incidents and only 20% detected within one minute.
The company embarked on a full‑scale optimization, addressing fault discovery, root‑cause localization, and repair across the monitoring pipeline.
Background : Existing metrics were abundant yet insufficient for timely fault detection, prompting a shift toward MTTR‑driven improvements.
1. Building Second‑Level Monitoring
Challenges included high storage I/O from Graphite's Whisper, the need to redesign the entire data‑collection‑alert chain for sub‑second granularity, and Graphite protocol compatibility.
After evaluating M3DB and VictoriaMetrics (VM), VM was selected for its compression, performance, and scalability, despite occasional degradation on complex aggregation queries.
To mitigate aggregation bottlenecks, a storage‑compute separation was implemented: VM handles simple metric queries, while CarbonAPI processes complex aggregations, enhanced with a metadata DB for efficient lookups.
1.3 Client‑Side Metric Collection Optimization
The original minute‑level SDK was refactored to support second‑level snapshots, employing data sampling (Tdigest for timers) and multi‑snapshot generation to balance memory usage and precision.
Scheduler enhancements introduced a snapshot manager and real‑time configuration push via a config service.
1.4 Server‑Side Metric Collection Optimization
The master‑worker architecture was refined by removing the message queue, partitioning tasks via Etcd, and moving scheduling logic to workers, enabling high‑concurrency Go‑based processing.
2. Fault‑Root‑Cause Analysis Platform
The platform constructs a knowledge graph from events, logs, traces, alerts, and application profiles, establishing service call chains, resource dependencies, and anomaly correlations.
Analysis comprises application‑level checks (runtime, middleware, logs, events) and link‑level tracing, with strategies to filter relevant traces using anomaly flags, T‑value classification, and topology similarity.
Weighting mechanisms (static, dynamic, application, dependency) prioritize suspect services, and strong/weak dependency pruning further narrows root causes.
3. Practical Outcomes
Post‑optimization, average fault discovery dropped to under one minute, with root‑cause identification accuracy reaching 70‑80% and overall MTTR improvement of 75%.
The ongoing pre‑plan system aims to automate incident response by triggering SOPs based on weighted alerts and detected anomalies.
Qunar Tech Salon
Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.