Design and Implementation of a Second-Level Monitoring System for Qunar Travel
This article details the background, overall architecture, challenges, and step‑by‑step redesign of Qunar Travel's Watcher monitoring platform to achieve second‑level (per‑second) data collection, storage, and alerting, including storage engine selection, client and server optimizations, deployment strategies, and operational outcomes.
The rapid growth of hotel order volume at Qunar Travel highlighted the limitations of the existing minute‑level Watcher monitoring system, prompting a shift to second‑level precision to detect faults faster and reduce order losses.
The original architecture comprised four components—data collection via the custom qmonitor, storage in Graphite, API querying, Grafana‑based dashboards, and alerting through Icinga—each operating at minute granularity.
Key challenges identified were excessive storage I/O and space due to Whisper's pre‑allocation, compatibility with the Graphite protocol, and the need for a full‑stack overhaul to support per‑second metrics.
To address storage concerns, a comparative evaluation of M3DB and VictoriaMetrics was performed; VictoriaMetrics was selected for its high compression, scalability, and native Graphite support, despite performance degradation in complex aggregation scenarios.
Client‑side improvements introduced a dual‑layer approach: a new calculation layer handling data sampling and determining which metrics require second‑level collection, and a snapshot manager to generate multiple snapshots, allowing core order‑related metrics to be captured at per‑second frequency while retaining minute‑level data for others.
Server‑side redesign moved scheduling responsibilities to worker nodes, adopted Go and Goroutine for high concurrency, and leveraged etcd for task partitioning, resulting in a scalable master‑worker model capable of both minute‑ and second‑level data collection.
Operational enhancements included a whitelist mechanism for second‑level metric activation, addition of a second‑level data source in dashboards, and refined alerting templates (SL1, SL2) synchronized with per‑second sampling to improve detection speed and reduce false alarms.
Post‑deployment results showed fault detection time reduced from four minutes to under one minute, increased alert accuracy, and broader adoption across multiple business lines, with plans to incorporate real‑time anomaly detection algorithms for proactive issue identification.
Qunar Tech Salon
Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.