Operations 13 min read

Design and Optimization of a Large‑Scale Monitoring System at Qunar.com

This article describes the architecture, challenges, and performance optimizations of Qunar.com's Watcher monitoring platform, covering massive metric collection, master‑worker redesign, Graphite/Whisper storage enhancements, and future migration to Go‑based cloud‑native solutions.

Qunar Tech Salon
Qunar Tech Salon
Qunar Tech Salon
Design and Optimization of a Large‑Scale Monitoring System at Qunar.com

Qunar.com operates a one‑stop monitoring platform called Watcher, originally built on Graphite+Whisper for storage and Grafana for the front‑end, handling billions of metric points per minute and millions of alerts.

The system needed to evolve from basic infrastructure monitoring to comprehensive application‑level observability, requiring robust data collection, visualization, and anomaly detection capabilities.

To address massive metric ingestion, Qunar developed the Qmonitor system, similar to Prometheus' scrape module, and adopted a master‑worker architecture where the master distributes tasks to workers via a message queue.

As metric volume grew, bottlenecks appeared in task dispatching and CPU usage during aggregation. The architecture was refined by removing the message queue, registering workers in etcd, and using etcd‑based rebalance to assign tasks, resulting in lower latency and higher availability.

Workers now cache tasks, scrape URLs concurrently, parse various client formats into a unified SingleMetric object, aggregate them into Metric objects, and push the results to the storage cluster.

Storage pressure was mitigated by fully distributing the Graphite components across multiple clusters with Salt automation, isolating business units to reduce single‑node load.

The built‑in carbon‑aggregator suffered from high CPU, memory consumption, and memory leaks. Optimizations included a global cache to avoid per‑rule caches, an asynchronous TTL‑based cleanup mechanism, and early exit after the first matching rule.

Configuration example:

# cache all metrics
metric.
(60) = last metric.<
>

Disk I/O was high due to Whisper's per‑metric file design. The team introduced write‑merge logic with a last aggregation method and a periodic flush every 60 seconds, reducing redundant writes.

Additional measures such as token‑bucket rate limiting and asynchronous Whisper writes further lowered disk usage.

In summary, the article highlights that metric collection and storage are foundational for monitoring systems, discusses the trade‑offs of using Whisper versus alternatives like ClickHouse, and notes a strategic shift toward Go‑based, cloud‑native components for future development.

Distributed SystemsmonitoringCloud NativePerformance OptimizationCI/CDMetrics
Qunar Tech Salon
Written by

Qunar Tech Salon

Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.