Inside Qunar’s Watcher: Building a Scalable Monitoring System for Millions of Metrics
The article details how Qunar’s operations team designed and implemented the Watcher monitoring platform—based on Graphite, Grafana, and Nagios—to achieve high availability, horizontal scalability, and rich visualization for over six million system metrics and two million business metrics.
Why Monitoring Matters
Monitoring is essential for four reasons: real‑time alerts when servers or services fail, early warning of performance degradation, rapid root‑cause analysis using detailed metric relationships, and capacity planning based on historical trends.
Origins of Watcher
Qunar initially used Cacti, which offered extensive plugins but suffered from a single‑point‑of‑failure architecture, poor horizontal scalability, limited storage accuracy, and weak visualization/API support. By the end of 2014 the team decided to build a new system that was reliable, highly available, easy to expand, and provided accurate data and rich visualizations.
Design and Technology Choices
The team evaluated OpenTSDB, InfluxDB, and Graphite. OpenTSDB required Hadoop and high startup cost; early InfluxDB was unstable. After internal testing and a recommendation from Douban’s Professor Hong, Graphite was selected for its powerful data collection, friendly Render API, simple Whisper file storage, and native distributed design that scales out horizontally.
Watcher Architecture
Watcher combines Graphite (carbon, whisper, graphite‑api) with Grafana for dashboards and Nagios for alerting. Carbon receives metric data via a simple text protocol, stores it as Whisper files, and graphite‑api serves data and images through RESTful URLs. The architecture is fully distributed, allowing each layer to be load‑balanced and expanded independently.
Data Flow and Cluster Layout
Data collection uses collectd on Linux, SSC Serv on Windows, and Qmonitor Server for business metrics. All agents push data via a single nc command or programmatic push. The first layer of carbon relays distributes incoming data across multiple servers using consistent hashing and redundancy. A second relay layer forwards data to carbon caches, which write Whisper files. Users retrieve metrics through graphite‑api, which feeds Grafana dashboards; Nagios and a custom checker handle alerting.
Use Cases at Qunar
Watcher supports system‑level monitoring (CPU, load, network traffic) with auto‑deployed agents and templated dashboards, as well as business‑level monitoring (API latency, request counts, booking volumes) collected via Qmonitor client libraries embedded in applications. Middleware components such as MySQL, JVM, Redis, and Memcached are also monitored, and a separate log‑monitoring platform handles log collection, analysis, and search.
Current Status and Future Challenges
Watcher now covers over 6 million system metrics and 2 million business metrics, handling roughly 15 million metric points per minute. While it meets the original goals of high availability, horizontal scalability, and flexible visualization, the team plans to improve rapid scaling procedures and operational efficiency.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
