Databases 14 min read

Transforming MySQL Monitoring: From Nagios to Kafka‑Powered Alerts

Qunar’s DBA team overhauled their MySQL monitoring and alert system—originally built on Nagios and NRPE—by integrating a Kafka‑based pipeline, a custom alarm service, and MySQL‑stored alert templates, achieving flexible thresholds, granular silencing, high‑availability processing, and early‑stage intelligent management of alerts, slow queries, and disk space.

dbaplus Community
dbaplus Community
dbaplus Community
Transforming MySQL Monitoring: From Nagios to Kafka‑Powered Alerts

Background

Qunar’s database monitoring and alerting platform originally relied on Nagios combined with the NRPE plugin. Nagios scheduled checks via check_nrpe, which called the NRPE daemon on monitored hosts to run scripts and return metrics. Alerts were sent through a notification plugin to email or phone.

Problems with the Existing Setup

Alert thresholds and severity levels could not be adjusted flexibly.

Alert silencing lacked granularity and incurred long delays.

All alerts were emitted through a single channel, making it impossible to prioritize critical alerts.

Changing alert configurations required editing multiple scripts and often restarting NRPE, a cumbersome and error‑prone process for DBAs.

Improvement Strategy

Instead of replacing the whole monitoring stack, the team extended the existing architecture by adding a Kafka‑based pipeline and a custom alarm service. The notification plugin was modified to write monitoring data into Kafka, where the alarm program consumes the stream, evaluates thresholds, determines severity, and dispatches alerts.

Improved monitoring architecture
Improved monitoring architecture

Alarm Program Details

The overall architecture before the notification plugin remains unchanged.

The notification plugin now only writes metrics to Kafka.

An alarm program reads from Kafka, applies flexible threshold rules, determines alert levels, and sends notifications.

Alert configurations (thresholds, levels, templates, silencing windows) are stored in a MySQL database, allowing changes via simple table updates.

To avoid a single point of failure, the alarm service can be deployed on multiple hosts (e.g., alarm1, alarm2) for high availability.

Existing NRPE scripts and thresholds can stay unchanged or be unified under the new standard.

Intelligent Features

1. Alert Management

Fine‑grained silencing is supported at host, instance, or metric level, with configurable durations. Silencing occurs only in the alarm stage, preserving Nagios scheduling and metric collection, which ensures complete data for later analysis.

2. Slow‑Query Management

When a slow‑query metric is detected, the alarm service connects to the relevant instance, retrieves the longest‑running unfinished query, displays its details (user, SQL text), and generates a kill command. This reduces manual SSH work and the risk of accidental operations.

3. Disk‑Space Management

The system distinguishes between directory‑level and file‑level space statistics. Directory‑level stats target batch‑cleanable files such as MySQL binlogs, while file‑level stats handle irregular files like error logs. Configurable templates drive automated collection and reporting, enabling quick, low‑cost cleanup actions.

Disk space management example
Disk space management example

Future Directions

Statistical analysis of historical alerts to identify frequent issue types and instance‑level distributions.

Multi‑metric correlation to reduce false positives and detect complex degradation patterns.

Detection of abrupt metric spikes (e.g., QPS/TPS, connection count) for early anomaly identification.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

monitoringautomationKafkaAlertingmysqlDBA
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.