Redesigning Database Monitoring: From Push to Pull for Smarter Alerts
This article analyzes the shortcomings of the legacy database monitoring system, explains the transition from a push‑based to a pull‑based architecture, outlines comprehensive metric collection, intelligent alert strategies, and self‑healing mechanisms, and showcases the performance improvements achieved with the new solution.
Introduction
The existing online alarm system for databases has been stable but suffers from missed alerts, false alarms, and fragmented metric collection across servers. To address these issues, a systematic review of current risks and targeted solutions is required.
Glossary
collectd – daemon that periodically gathers system and application metrics.
nagios – open‑source monitoring system for IT infrastructure.
nrpe – nagios plugin allowing remote command execution.
prometheus – cloud‑native monitoring system that stores metrics as time‑series data.
exporter – prometheus component that runs on target hosts and exposes metrics.
Current System Bottlenecks
3.1 Current State
Database monitoring is essential; however, most Qunar databases still run on physical machines with multiple instances per host, using a collectd‑based data collection and nagios‑based alerting pipeline.
Architecture of the metric collection system:
Each database server runs a collectd agent that pushes data via UDP to a collectd server, which stores the data for web queries.
Architecture of the alarm system:
Each server also runs an nrpe service; nagios polls nrpe, which executes scripts and returns results. Alerts are sent to an alert service that enriches metadata before notifying owners.
While these systems meet most needs, they cannot satisfy evolving requirements and expose several pain points.
3.2 Pain Points
Collectd’s push mode uses UDP, leading to occasional data loss and difficult root‑cause tracing.
Limited metric set – only basic DB and server metrics are collected.
Missing metadata in the storage layer prevents high‑level aggregation.
Alert granularity is coarse (minute‑level), causing missed rapid anomalies.
Fixed collection intervals cannot capture transient spikes.
Agent restarts are required for every new metric, increasing operational cost.
Technical Choice: From Push to Pull
The push model pushes data from agents to a server, while the pull model lets the server request data from agents. Advantages and disadvantages are compared:
Push Advantages: simple agent deployment, stateless server, lower management cost.
Push Disadvantages: uniform collection intervals, UDP data loss, no tagging support.
Pull Advantages: fine‑grained interval control, server can add tags for high‑level aggregation, fewer network hops reduce data loss.
Pull Disadvantages: server becomes stateful, scaling and HA are more complex, agents need dedicated ports.
Given the analysis, the pull model is selected for its flexibility and reliability.
New System Goals
Comprehensive metric collection without gaps.
Configurable collection frequencies to avoid data loss.
Tagging of metrics for high‑level aggregation.
Reasonable storage cost.
Timely fault detection (1‑5‑10 response model).
Reduce false positives and missed alerts.
Full‑Scale Collection: Grouped and Tiered
Metrics are categorized by type:
stat – instantaneous values (e.g., thread count, replication lag) – stored as Prometheus Gauge.
util – derived rates (e.g., CPU usage) – stored as Counter, Histogram, or Summary.
info – static information (e.g., DB version) – stored as latest snapshot.
Collection scope includes MySQL, Redis, and OS metrics, both internal (process‑level) and external (resource‑level). Example diagrams:
Intelligent Alert Strategy
Alerts are defined as alert_threshold * trigger_count. Static thresholds are insufficient due to workload variations; therefore, time‑based thresholds are introduced (e.g., higher threshold during scheduled batch jobs).
Threshold table:
Time Slot
Threshold
Note
8:00‑12:30
5
Higher due to scheduled tasks
Other times
2
Normal operation
Dynamic detection intervals are used: when a spike is observed, the next check interval is shortened, reducing alert latency.
Detection logic examples:
<span>stat = collect()</span></code><code><span>if stat > threshold:</span></code><code><span> alert()</span> <span>last_time = now()</span></code><code><span>last = collect()</span></code><code><span>... </span></code><code><span>current_time = now()</span></code><code><span>current = collect()</span></code><code><span>if (current - last) / (current_time - last_time) > threshold:</span></code><code><span> alert()</span>Self‑Healing
Beyond alerting, the system attempts automatic remediation. When an alert is raised, a heal module drills down into metric details, executes predefined actions, and logs results. Complex actions still require human confirmation.
Results: Sample Dashboards
Various dashboards demonstrate the new system’s capabilities, such as:
MySQL QPS and latency per SQL.
Process‑level CPU, read/write I/O.
Server disk capacity planning.
Host CPU and network traffic monitoring.
Department‑level Redis usage with custom tags.
Example images:
Conclusion and Outlook
The new collection system provides richer, tagged data that supports higher‑level analysis, reporting, and proactive operations. Optimized detection pipelines and plugin‑based monitoring enable rapid issue identification, meeting the “1‑5‑10” incident response goal while laying groundwork for future enhancements.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Qunar Tech Salon
Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
