Improving Qunar.com Database Monitoring and Alert System with a Kafka‑Based Alarm Program
The article describes how Qunar.com upgraded its Nagios/NRPE‑based database monitoring by inserting a Kafka‑driven alarm component, centralizing alert configuration in MySQL, adding flexible shielding and multi‑channel notifications, and exploring intelligent features such as slow‑query and disk‑space management.
Background
Qunar.com originally used Nagios together with the NRPE plugin to monitor MySQL instances, collecting metrics such as CPU load and disk usage. The architecture relied on check_nrpe calls from Nagios to remote hosts, with alerts sent via notification plugins to email or phone.
The existing system suffered from inflexible alert thresholds, rigid severity levels, limited shielding options, delayed mute periods, and a single notification channel that made it hard to prioritize critical alerts.
Improvement Roadmap
Rather than replacing the whole stack, the team added a Kafka‑based pipeline. The notification plugin now writes monitoring data to a Kafka topic; a new alarm program consumes this data, evaluates alert rules stored in MySQL, and dispatches alerts through appropriate channels.
Alarm Program Details
The alarm service performs the following steps:
Consume monitoring messages from Kafka (including metric values, timestamps, host name, and template name).
Query MySQL for the corresponding alert template configuration.
Apply comparison methods, regexes, and thresholds to determine the alert level.
Check mute periods and shielding rules (host‑level, instance‑level, or metric‑level).
Group alerts by severity and send them via QTalk, phone calls, or other channels.
Alert templates are stored centrally in MySQL, allowing DBA staff to modify thresholds, enable/disable templates, and define shielding windows without touching scripts. High‑availability is achieved by deploying multiple alarm instances.
Intelligent Exploration
The upgraded system now supports:
Alert Management : flexible shielding, granular control, and statistical analysis without losing raw monitoring data.
Slow‑Query Management : detection of long‑running queries, automatic retrieval of query details, and optional kill actions presented to DBAs.
Disk‑Space Management : template‑driven classification of directories (e.g., binlog vs. log files) and automated cleanup based on usage thresholds.
These features lay the foundation for future work such as historical alert analytics, multi‑metric correlation, and anomaly detection on metric spikes.
Future Directions
Statistical analysis of alert frequencies and distribution across instances.
Joint analysis of multiple metrics to reduce false positives.
Detection of sudden metric spikes (e.g., QPS/TPS, connection count) for early anomaly warning.
The article concludes with a call for further development and a recruitment notice.
Qunar Tech Salon
Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.