Transforming MySQL Monitoring: From Nagios to Kafka‑Powered Alerts
Qunar’s DBA team overhauled their MySQL monitoring and alert system—originally built on Nagios and NRPE—by integrating a Kafka‑based pipeline, a custom alarm service, and MySQL‑stored alert templates, achieving flexible thresholds, granular silencing, high‑availability processing, and early‑stage intelligent management of alerts, slow queries, and disk space.
Background
Qunar’s database monitoring and alerting platform originally relied on Nagios combined with the NRPE plugin. Nagios scheduled checks via check_nrpe, which called the NRPE daemon on monitored hosts to run scripts and return metrics. Alerts were sent through a notification plugin to email or phone.
Problems with the Existing Setup
Alert thresholds and severity levels could not be adjusted flexibly.
Alert silencing lacked granularity and incurred long delays.
All alerts were emitted through a single channel, making it impossible to prioritize critical alerts.
Changing alert configurations required editing multiple scripts and often restarting NRPE, a cumbersome and error‑prone process for DBAs.
Improvement Strategy
Instead of replacing the whole monitoring stack, the team extended the existing architecture by adding a Kafka‑based pipeline and a custom alarm service. The notification plugin was modified to write monitoring data into Kafka, where the alarm program consumes the stream, evaluates thresholds, determines severity, and dispatches alerts.
Alarm Program Details
The overall architecture before the notification plugin remains unchanged.
The notification plugin now only writes metrics to Kafka.
An alarm program reads from Kafka, applies flexible threshold rules, determines alert levels, and sends notifications.
Alert configurations (thresholds, levels, templates, silencing windows) are stored in a MySQL database, allowing changes via simple table updates.
To avoid a single point of failure, the alarm service can be deployed on multiple hosts (e.g., alarm1, alarm2) for high availability.
Existing NRPE scripts and thresholds can stay unchanged or be unified under the new standard.
Intelligent Features
1. Alert Management
Fine‑grained silencing is supported at host, instance, or metric level, with configurable durations. Silencing occurs only in the alarm stage, preserving Nagios scheduling and metric collection, which ensures complete data for later analysis.
2. Slow‑Query Management
When a slow‑query metric is detected, the alarm service connects to the relevant instance, retrieves the longest‑running unfinished query, displays its details (user, SQL text), and generates a kill command. This reduces manual SSH work and the risk of accidental operations.
3. Disk‑Space Management
The system distinguishes between directory‑level and file‑level space statistics. Directory‑level stats target batch‑cleanable files such as MySQL binlogs, while file‑level stats handle irregular files like error logs. Configurable templates drive automated collection and reporting, enabling quick, low‑cost cleanup actions.
Future Directions
Statistical analysis of historical alerts to identify frequent issue types and instance‑level distributions.
Multi‑metric correlation to reduce false positives and detect complex degradation patterns.
Detection of abrupt metric spikes (e.g., QPS/TPS, connection count) for early anomaly identification.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
