Operations 20 min read

How Qunar Achieved Sub‑Second Monitoring to Slash Fault Detection Time to Under 1 Minute

Qunar’s Watcher monitoring platform was upgraded from minute‑level to second‑level precision, redesigning storage, data collection, and alerting pipelines, adopting VictoriaMetrics, enhancing client SDKs, and adding fine‑grained alarm rules, which reduced fault detection from four minutes to under one minute while improving reliability and scalability.

dbaplus Community
dbaplus Community
dbaplus Community
How Qunar Achieved Sub‑Second Monitoring to Slash Fault Detection Time to Under 1 Minute

Background

Rapid growth in hotel booking traffic caused order‑related failures to be detected several minutes after they occurred because the existing Watcher monitoring system only provided minute‑level metrics. To meet a "1‑minute detection – 5‑minute定位 – 10‑minute recovery" goal, the team upgraded monitoring precision to the second level.

Overall Architecture

Watcher consists of four layers:

Data collection – custom qmonitor client/server.

Time‑series storage – originally Graphite/Whisper.

Query API – extended Graphite API.

Dashboard – Grafana‑based.

Alarm service – Icinga‑based notifications.

All layers originally operated on minute‑level data points.

Key Challenges

High storage I/O and space consumption caused by Whisper’s pre‑allocation and write amplification.

Maintaining compatibility with the Graphite protocol while migrating.

Re‑engineering the full data‑flow (collection → storage → query → alert) to support second‑level granularity.

Storage Refactor

The team evaluated M3DB and VictoriaMetrics (VM) as drop‑in replacements that support the Graphite protocol. Benchmarking on a 32‑core, 64 GB, 3.2 TB SSD machine showed:

Write throughput of 1 M points per minute, average latency ≈100 ms.

Single‑metric query QPS ≈2 000 with stable CPU load.

Complex aggregation queries slowed down; these were routed to the original CarbonAPI.

VM was selected for its high compression ratio, read/write performance (up to 10 M metrics / s), and simple deployment. Deployment uses vmstorage, vminsert, and vmselect components.

Client Metric‑Collection Optimization

Two redesign options were considered:

Adopt a Prometheus‑style pull model, eliminating snapshots. This reduces client memory but requires extensive SDK changes.

Retain snapshot generation but produce separate minute‑ and second‑level snapshots. This required minimal SDK changes and avoided precision loss.

Option 2 was chosen. A new metadata DB stores metric names and query URLs; the CarbonAPI resolves multi‑label or function‑based metrics to single‑label metrics before forwarding them to VM, improving query performance.

Server Metric‑Collection Optimization

The original master‑worker architecture used a message queue that introduced up to 12 seconds of scheduling latency, breaking second‑level collection. The solution moved scheduling logic into the workers, making them stateful while allowing the master to reassign tasks on failure. The system was rewritten in Go using goroutines for high concurrency.

Key changes:

Removed the message queue but kept the master‑worker pattern.

Implemented task partitioning via Etcd; the master distributes tens of thousands of tasks across workers.

Workers pull tasks directly, reducing latency to sub‑second levels.

Second‑Level Monitoring Usage

Second‑level monitoring is applied only to core business metrics (order volume, transaction failure rate, smoothness) to avoid alarm storms and unnecessary resource consumption. Dashboards were reorganized to separate minute‑ and second‑level panels. A whitelist mechanism in the SDK enables selected metrics to be collected at second granularity without code changes.

Alarm Configuration

New alarm templates (SL1, SL2) align notification frequency with second‑level sampling:

SL1 – immediate phone and QTalk alerts.

SL2 – immediate QTalk alert followed by a phone call after three minutes.

Fine‑grained rules and holiday/weekend adjustments further reduce false positives.

Operational Results

After rollout across multiple business lines:

Fault detection time dropped from four minutes to under one minute.

ATP fault‑1‑minute discovery rate and alert precision improved steadily.

Core business indicators (order count, failure rate, transaction smoothness) are now monitored in real time.

Future Plans

The team plans to add algorithmic trend analysis for proactive anomaly detection, including intelligent alerts for sudden drops or continuous degradations, and to extend second‑level coverage to additional critical metrics.

Key Diagrams

Watcher architecture
Watcher architecture
Storage refactor
Storage refactor
Client metric collection
Client metric collection
Server metric collection
Server metric collection
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

monitoringObservabilityDevOpsTime Series Database
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.