Operations 7 min read

Qunar's Watcher Monitoring System: Design, Implementation, and Operational Practices

Zhang Yue, a Qunar operations engineer, discusses the design, selection, architecture, scalability challenges, visualization, alert strategies, and future plans of the company's in‑house monitoring platform Watcher, highlighting lessons learned from migrating from Cacti to a Graphite‑based, Grafana‑enhanced solution.

Qunar Tech Salon
Qunar Tech Salon
Qunar Tech Salon
Qunar's Watcher Monitoring System: Design, Implementation, and Operational Practices

At the 2016 APMCon conference, Qunar operations engineer Zhang Yue presented "Watcher"—the company's home‑grown monitoring system—covering its development journey, design choices, and operational experience.

Watcher evolved from a two‑person effort to a three‑person team responsible for monitoring most of Qunar's core services. The original solution, Cacti, proved inadequate as metric volume grew, suffering from single‑point failures, poor horizontal scalability, limited visualization, and lack of an open API.

To address these issues, the team set four primary goals for the new system: high availability, horizontal scalability, enhanced visualization, and an open API.

For data accuracy and scalability, Watcher adopted Graphite, which supports scale‑out architecture, allowing larger uncompressed time ranges and sampled monitoring to maintain precision. Visualization is powered by a customized Grafana instance that integrates Qunar's product hierarchy and user system, offering multi‑dimensional dashboards and flexible data displays.

Alerting in Watcher is rule‑based, supporting static thresholds, week‑over‑week comparisons, frequency checks, multi‑trigger alerts, time‑window specific rules, temporary rules, on‑call rotations, callbacks, and multiple notification channels.

Future focus areas include cost optimization—handling over 8 million metrics with 1.5 million per‑minute ingestion—and improving personnel efficiency through refined alarm response processes.

The automation and ops stack includes Graphite, Grafana, Collectd, and various infrastructure tools such as LVS, HAProxy, Docker, Mesos, SaltStack, Ansible, and Ceph, selected based on scenario fit and maturity.

Specific monitoring measures for order cancellation involve counting cancellation events and setting threshold alerts, including comparative metrics.

In her upcoming APMCon talk, Zhang will share Watcher's design, selection rationale, architecture, encountered challenges, and practical lessons for building monitoring systems with open‑source components.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

monitoringOperationsAlertingGrafanaWatcherGraphite
Qunar Tech Salon
Written by

Qunar Tech Salon

Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.