Operations 22 min read

Optimizing Qunar's Monitoring System for Faster Fault Detection and Root‑Cause Analysis

This article details Qunar's comprehensive overhaul of its monitoring platform—introducing second‑level metrics, redesigning storage with VictoriaMetrics, optimizing client and server data collection, and building a root‑cause analysis tool—to dramatically reduce order‑related fault discovery time from minutes to under one minute.

Qunar Tech Salon

Nov 22, 2023

Optimizing Qunar's Monitoring System for Faster Fault Detection and Root‑Cause Analysis

Qunar's original monitoring system handled billions of metrics and millions of alerts but struggled with fault detection, averaging four minutes for order‑related incidents and only 20% detected within one minute.

The company embarked on a full‑scale optimization, addressing fault discovery, root‑cause localization, and repair across the monitoring pipeline.

Background : Existing metrics were abundant yet insufficient for timely fault detection, prompting a shift toward MTTR‑driven improvements.

1. Building Second‑Level Monitoring

Challenges included high storage I/O from Graphite's Whisper, the need to redesign the entire data‑collection‑alert chain for sub‑second granularity, and Graphite protocol compatibility.

After evaluating M3DB and VictoriaMetrics (VM), VM was selected for its compression, performance, and scalability, despite occasional degradation on complex aggregation queries.

To mitigate aggregation bottlenecks, a storage‑compute separation was implemented: VM handles simple metric queries, while CarbonAPI processes complex aggregations, enhanced with a metadata DB for efficient lookups.

1.3 Client‑Side Metric Collection Optimization

The original minute‑level SDK was refactored to support second‑level snapshots, employing data sampling (Tdigest for timers) and multi‑snapshot generation to balance memory usage and precision.

Scheduler enhancements introduced a snapshot manager and real‑time configuration push via a config service.

1.4 Server‑Side Metric Collection Optimization

The master‑worker architecture was refined by removing the message queue, partitioning tasks via Etcd, and moving scheduling logic to workers, enabling high‑concurrency Go‑based processing.

2. Fault‑Root‑Cause Analysis Platform

The platform constructs a knowledge graph from events, logs, traces, alerts, and application profiles, establishing service call chains, resource dependencies, and anomaly correlations.

Analysis comprises application‑level checks (runtime, middleware, logs, events) and link‑level tracing, with strategies to filter relevant traces using anomaly flags, T‑value classification, and topology similarity.

Weighting mechanisms (static, dynamic, application, dependency) prioritize suspect services, and strong/weak dependency pruning further narrows root causes.

3. Practical Outcomes

Post‑optimization, average fault discovery dropped to under one minute, with root‑cause identification accuracy reaching 70‑80% and overall MTTR improvement of 75%.

The ongoing pre‑plan system aims to automate incident response by triggering SOPs based on weighted alerts and detected anomalies.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Monitoring cloud-native Microservices Operations fault detection TSDB root-cause-analysis

Written by

Qunar Tech Salon

Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.