How We Upgraded Watcher to Second‑Level Monitoring for Real‑Time Order Alerts
This article details the end‑to‑end redesign of Quora Travel's Watcher monitoring platform from minute‑level to second‑level precision, covering architectural changes, storage engine migration, client‑side metric collection, server‑side scheduling, dashboard and alarm adaptations, and the resulting operational improvements.
Background
Rapid growth in hotel order volume exposed the limitation of Watcher’s minute‑level metrics: fault detection lagged by several minutes. The target was a "1‑minute detection – 5‑minute定位 – 10‑minute recovery" workflow, requiring second‑level monitoring to reduce order loss.
Original Watcher Architecture
Watcher consisted of four layers:
Data collection : qmonitor client/server/manager reported metrics to Graphite (Whisper storage).
Data query : Enhanced Graphite API with a metadata DB for cluster‑aware queries.
Dashboard : Grafana‑based UI for visualisation.
Alarm : Icinga‑driven notifications (phone, QTalk, SMS, etc.).
All components operated on minute‑level data points.
Challenges for Second‑Level Monitoring
Storage I/O and space : Whisper’s pre‑allocation and write amplification caused high disk I/O and large storage consumption.
Graphite protocol compatibility : Existing collectors and APIs depended on Graphite; second‑level support had to remain transparent to users.
Full‑stack modifications : Every link—from collection through storage to alarm—required redesign to handle the increased data frequency.
Storage Refactoring
Two time‑series databases supporting the Graphite protocol were evaluated:
M3DB : High compression, scalable, Graphite‑compatible, but deployment and maintenance are complex.
VictoriaMetrics (VM) : Single‑node read/write throughput up to 10 million points per second, easy deployment, active community; performance degrades on heavy aggregation queries.
Benchmark on a 32‑CPU, 64 GB RAM, 3.2 TB SSD server:
Write 10 million points per minute.
Handle 2000 QPS query load.
Average response ~100 ms.
Disk usage ~40 GB per day.
Complex aggregation queries still showed noticeable latency.
Compute‑Storage Separation
To mitigate query‑performance issues, a split architecture was adopted:
Simple metrics are stored and queried directly in VM.
Complex aggregation queries are delegated to CarbonAPI, an open‑source Graphite‑compatible layer that can be extended for custom processing.
Client‑Side Metric Collection Optimization
Current Issues
The existing client generated a snapshot every minute, preventing real‑time second‑level data.
Solution
A white‑list mechanism was introduced. A dedicated second‑level metric metadata DB records metric names and query URLs. When a metric appears on the white‑list, the SDK records both minute‑level and second‑level points without code changes.
Refactoring Details
Added a computation layer that performs data sampling and decides whether a metric requires second‑level collection.
Extended the scheduler with a snapshot manager to handle multiple snapshots and push real‑time configuration to clients.
Server‑Side Metric Collection Optimization
Current Issues
The master‑worker model introduced task dispatch delays up to 12 seconds and high CPU usage due to Python‑based multi‑process aggregation.
Solution
Task scheduling was moved to worker nodes, making workers stateful. Workers now pull tasks directly via etcd‑based health checks. The system was rewritten in Go using goroutines for high concurrency.
Post‑Refactor Architecture
Removed the message queue.
Introduced task partitioning via etcd; the master only assigns partitions.
Workers listen to etcd events, fetch assigned tasks, cache them in memory, and execute them concurrently.
Differences from Minute‑Level Monitoring
Second‑level monitoring increases sampling frequency, leading to higher storage and CPU demands. Data is stored in VictoriaMetrics instead of Whisper, delivering roughly 10× faster queries and 8× better compression.
Second‑Level Use Cases
Design dashboards that focus on core business metrics (order volume, failure rate, etc.).
Upgrade SDKs to emit second‑level metrics via a white‑list, avoiding code changes.
Add a second‑level data source in Watcher’s dashboard and distinguish panels by source.
Configure finer‑grained alarm rules with separate notification templates (SL1, SL2) aligned to second‑level sampling.
Operational Impact
After deployment, fault detection time dropped from ~4 minutes to under 1 minute, and alert accuracy improved. The solution has been rolled out across multiple business lines.
Future Plans
Implement real‑time anomaly detection algorithms to automatically identify abnormal point‑wise or continuous trends and trigger intelligent alerts.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
