Databases 25 min read

Redesigning Database Monitoring: From Push to Pull for Smarter Alerts

This article analyzes the shortcomings of the legacy database monitoring system, explains the transition from a push‑based to a pull‑based architecture, outlines comprehensive metric collection, intelligent alert strategies, and self‑healing mechanisms, and showcases the performance improvements achieved with the new solution.

Qunar Tech Salon
Qunar Tech Salon
Qunar Tech Salon
Redesigning Database Monitoring: From Push to Pull for Smarter Alerts

Introduction

The existing online alarm system for databases has been stable but suffers from missed alerts, false alarms, and fragmented metric collection across servers. To address these issues, a systematic review of current risks and targeted solutions is required.

Glossary

collectd – daemon that periodically gathers system and application metrics.

nagios – open‑source monitoring system for IT infrastructure.

nrpe – nagios plugin allowing remote command execution.

prometheus – cloud‑native monitoring system that stores metrics as time‑series data.

exporter – prometheus component that runs on target hosts and exposes metrics.

Current System Bottlenecks

3.1 Current State

Database monitoring is essential; however, most Qunar databases still run on physical machines with multiple instances per host, using a collectd‑based data collection and nagios‑based alerting pipeline.

Architecture of the metric collection system:

Each database server runs a collectd agent that pushes data via UDP to a collectd server, which stores the data for web queries.

Architecture of the alarm system:

Each server also runs an nrpe service; nagios polls nrpe, which executes scripts and returns results. Alerts are sent to an alert service that enriches metadata before notifying owners.

While these systems meet most needs, they cannot satisfy evolving requirements and expose several pain points.

3.2 Pain Points

Collectd’s push mode uses UDP, leading to occasional data loss and difficult root‑cause tracing.

Limited metric set – only basic DB and server metrics are collected.

Missing metadata in the storage layer prevents high‑level aggregation.

Alert granularity is coarse (minute‑level), causing missed rapid anomalies.

Fixed collection intervals cannot capture transient spikes.

Agent restarts are required for every new metric, increasing operational cost.

Technical Choice: From Push to Pull

The push model pushes data from agents to a server, while the pull model lets the server request data from agents. Advantages and disadvantages are compared:

Push Advantages: simple agent deployment, stateless server, lower management cost.

Push Disadvantages: uniform collection intervals, UDP data loss, no tagging support.

Pull Advantages: fine‑grained interval control, server can add tags for high‑level aggregation, fewer network hops reduce data loss.

Pull Disadvantages: server becomes stateful, scaling and HA are more complex, agents need dedicated ports.

Given the analysis, the pull model is selected for its flexibility and reliability.

New System Goals

Comprehensive metric collection without gaps.

Configurable collection frequencies to avoid data loss.

Tagging of metrics for high‑level aggregation.

Reasonable storage cost.

Timely fault detection (1‑5‑10 response model).

Reduce false positives and missed alerts.

Full‑Scale Collection: Grouped and Tiered

Metrics are categorized by type:

stat – instantaneous values (e.g., thread count, replication lag) – stored as Prometheus Gauge.

util – derived rates (e.g., CPU usage) – stored as Counter, Histogram, or Summary.

info – static information (e.g., DB version) – stored as latest snapshot.

Collection scope includes MySQL, Redis, and OS metrics, both internal (process‑level) and external (resource‑level). Example diagrams:

Intelligent Alert Strategy

Alerts are defined as alert_threshold * trigger_count. Static thresholds are insufficient due to workload variations; therefore, time‑based thresholds are introduced (e.g., higher threshold during scheduled batch jobs).

Threshold table:

Time Slot

Threshold

Note

8:00‑12:30

5

Higher due to scheduled tasks

Other times

2

Normal operation

Dynamic detection intervals are used: when a spike is observed, the next check interval is shortened, reducing alert latency.

Detection logic examples:

<span>stat = collect()</span></code><code><span>if stat > threshold:</span></code><code><span>    alert()</span>
<span>last_time = now()</span></code><code><span>last = collect()</span></code><code><span>... </span></code><code><span>current_time = now()</span></code><code><span>current = collect()</span></code><code><span>if (current - last) / (current_time - last_time) > threshold:</span></code><code><span>    alert()</span>

Self‑Healing

Beyond alerting, the system attempts automatic remediation. When an alert is raised, a heal module drills down into metric details, executes predefined actions, and logs results. Complex actions still require human confirmation.

Results: Sample Dashboards

Various dashboards demonstrate the new system’s capabilities, such as:

MySQL QPS and latency per SQL.

Process‑level CPU, read/write I/O.

Server disk capacity planning.

Host CPU and network traffic monitoring.

Department‑level Redis usage with custom tags.

Example images:

Conclusion and Outlook

The new collection system provides richer, tagged data that supports higher‑level analysis, reporting, and proactive operations. Optimized detection pipelines and plugin‑based monitoring enable rapid issue identification, meeting the “1‑5‑10” incident response goal while laying groundwork for future enhancements.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

alertingPrometheusself-healingmetric collectionDatabase Monitoringpush-pull architecture
Qunar Tech Salon
Written by

Qunar Tech Salon

Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.