Operations 10 min read

Inside Qunar’s Watcher: Building a Scalable Monitoring System for Millions of Metrics

The article details how Qunar’s operations team designed and implemented the Watcher monitoring platform—based on Graphite, Grafana, and Nagios—to achieve high availability, horizontal scalability, and rich visualization for over six million system metrics and two million business metrics.

ITPUB
ITPUB
ITPUB
Inside Qunar’s Watcher: Building a Scalable Monitoring System for Millions of Metrics

Why Monitoring Matters

Monitoring is essential for four reasons: real‑time alerts when servers or services fail, early warning of performance degradation, rapid root‑cause analysis using detailed metric relationships, and capacity planning based on historical trends.

Origins of Watcher

Qunar initially used Cacti, which offered extensive plugins but suffered from a single‑point‑of‑failure architecture, poor horizontal scalability, limited storage accuracy, and weak visualization/API support. By the end of 2014 the team decided to build a new system that was reliable, highly available, easy to expand, and provided accurate data and rich visualizations.

Design and Technology Choices

The team evaluated OpenTSDB, InfluxDB, and Graphite. OpenTSDB required Hadoop and high startup cost; early InfluxDB was unstable. After internal testing and a recommendation from Douban’s Professor Hong, Graphite was selected for its powerful data collection, friendly Render API, simple Whisper file storage, and native distributed design that scales out horizontally.

Watcher Architecture

Watcher combines Graphite (carbon, whisper, graphite‑api) with Grafana for dashboards and Nagios for alerting. Carbon receives metric data via a simple text protocol, stores it as Whisper files, and graphite‑api serves data and images through RESTful URLs. The architecture is fully distributed, allowing each layer to be load‑balanced and expanded independently.

Data Flow and Cluster Layout

Data collection uses collectd on Linux, SSC Serv on Windows, and Qmonitor Server for business metrics. All agents push data via a single nc command or programmatic push. The first layer of carbon relays distributes incoming data across multiple servers using consistent hashing and redundancy. A second relay layer forwards data to carbon caches, which write Whisper files. Users retrieve metrics through graphite‑api, which feeds Grafana dashboards; Nagios and a custom checker handle alerting.

Use Cases at Qunar

Watcher supports system‑level monitoring (CPU, load, network traffic) with auto‑deployed agents and templated dashboards, as well as business‑level monitoring (API latency, request counts, booking volumes) collected via Qmonitor client libraries embedded in applications. Middleware components such as MySQL, JVM, Redis, and Memcached are also monitored, and a separate log‑monitoring platform handles log collection, analysis, and search.

Current Status and Future Challenges

Watcher now covers over 6 million system metrics and 2 million business metrics, handling roughly 15 million metric points per minute. While it meets the original goals of high availability, horizontal scalability, and flexible visualization, the team plans to improve rapid scaling procedures and operational efficiency.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

ScalabilitymetricsGrafanaGraphiteNagios
ITPUB
Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.