Databases 7 min read

From Monitoring to Decision: MySQL Capacity Planning with Prometheus & Grafana

This guide walks through building a Prometheus‑Grafana monitoring stack for MySQL, selecting exporters, defining key metric groups, leveraging Performance Schema for deep insights, configuring tiered alerts, and applying trend‑based capacity planning to anticipate resource needs.

Senior Xiao Ying

Feb 28, 2026

From Monitoring to Decision: MySQL Capacity Planning with Prometheus & Grafana

Prometheus + Grafana Monitoring

Goal: Build a complete MySQL monitoring stack and understand the business meaning behind each chart.

Exporter selection

mysqld_exporter

: Official exporter that collects SHOW GLOBAL STATUS and SHOW GLOBAL VARIABLES. percona/mysqld_exporter: Percona‑customized version that also gathers performance_schema and information_schema. Recommended.

Key metric groups for Grafana dashboards

① Throughput & Load mysql_global_status_questions: QPS (queries per second). mysql_global_status_threads_connected: Connection count (water‑level). mysql_global_status_threads_running: Running threads (actual load).

② Latency & Response mysql_global_status_slow_queries: Slow‑query count. mysql_global_status_created_tmp_tables: Temporary tables; high values indicate index loss.

③ InnoDB Engine State mysql_global_status_innodb_data_reads / mysql_global_status_innodb_data_writes: Physical I/O. mysql_global_status_innodb_row_lock_waits: Row‑lock wait count. mysql_global_status_innodb_log_waits: Redo‑log write wait (memory shortage or I/O bottleneck).

④ Resource Utilization

CPU / Memory / Disk collected by node_exporter.

Disk I/O latency (iowait).

Layered monitoring – business layer watches QPS and connections, database layer watches locks, temp tables, slow queries, and system layer watches I/O – speeds root‑cause identification.

Performance Schema Deep Use

Goal: Expose MySQL’s internal state as queryable metrics.

Core configuration

[mysqld]
performance_schema=ON
performance-schema-consumer-events_statements_summary_by_digest=ON
performance-schema-consumer-events_statements_history_long=ON
performance-schema-consumer-events_transactions_history_long=ON
performance-schema-consumer-statements_digest=ON

Four practical scenarios

Scenario A – Who is eating CPU?

-- Show currently executing threads ordered by longest runtime
SELECT THREAD_ID, PROCESSLIST_ID, PROCESSLIST_USER, PROCESSLIST_DB,
       PROCESSLIST_COMMAND, PROCESSLIST_TIME, PROCESSLIST_INFO
FROM performance_schema.threads
WHERE PROCESSLIST_COMMAND != 'Sleep'
ORDER BY PROCESSLIST_TIME DESC;

Scenario B – Find the worst SQL

-- Rank statements by logical reads (memory‑intensive)
SELECT DIGEST_TEXT, COUNT_STAR, AVG_TIMER_WAIT/1e12 AS avg_ms,
       SUM_ROWS_EXAMINED, SUM_ROWS_SENT,
       (SUM_ROWS_EXAMINED / SUM_ROWS_SENT) AS scan_ratio
FROM performance_schema.events_statements_summary_by_digest
WHERE SUM_ROWS_SENT > 0
ORDER BY SUM_ROWS_EXAMINED DESC
LIMIT 10;

Scenario C – Transaction lock wait analysis

-- Show transactions holding or waiting for locks
SELECT * FROM performance_schema.metadata_locks;
SELECT * FROM performance_schema.data_locks;  -- MySQL 8.0

Scenario D – Memory allocation tracing (MySQL 8.0)

-- Identify memory components with highest consumption
SELECT EVENT_NAME, CURRENT_NUMBER_OF_BYTES_USED
FROM performance_schema.memory_summary_global_by_event_name
ORDER BY 2 DESC
LIMIT 10;

Key Metric Alert Configuration

Goal: Set effective thresholds to reduce alert noise.

Prometheus Alertmanager example

P0 – Immediate Action

groups:
- name: mysql_critical
  rules:
  - alert: MySQLDown
    expr: mysql_up == 0
    for: 1m
  - alert: MySQLConnectionSaturation
    expr: mysql_global_status_threads_connected / mysql_global_variables_max_connections > 0.85
    for: 2m

P1 – Needs Attention

- alert: ReplicationLag
  expr: mysql_slave_status_seconds_behind_master > 60
  for: 1m
- alert: SuddenSlowQueryIncrease
  expr: rate(mysql_global_status_slow_queries[5m]) > 0.5

P2 – Routine Optimization

- alert: TempTableDiskSpill
  expr: rate(mysql_global_status_created_tmp_disk_tables[5m]) > 10

Core alerting principles

Trend over static threshold : Prefer "disk growth rate > 5%/day" to a fixed "disk > 80%" alarm.

Time‑window aggregation : Use for: 2m to avoid transient spikes.

Capacity Planning: From Reactive Fire‑fighting to Proactive Defense

Goal: Answer "When to add memory?" and "When to shard?"

Trend extrapolation methodology

Use historical monitoring data to predict when a resource will be exhausted.

Example: Disk capacity planning

Collect node_filesystem_free_bytes for the past 90 days.

Model: Linear regression to compute daily decline rate.

Predict: Assuming constant rate, calculate the time when free space reaches zero.

Prometheus function used:

predict_linear(node_filesystem_free_bytes{device!~'rootfs'}[7d], 86400*7) < 0

Four‑dimension capacity model

Business‑cycle based planning

E‑commerce: Before Double‑11, stress test with "peak TPS × 1.5".

SaaS: Estimate using "monthly active users × average SQL per user × 2‑year growth rate".

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

monitoring Prometheus MySQL capacity planning Performance Schema Grafana Alertmanager

Written by

Senior Xiao Ying

Dedicated to sharing Java backend technical experience and original tutorials, offering career transition advice and resume editing. Recognized as a rising star in CSDN's Java backend community and ranked Top 3 in the 2022 New Star Program for Java backend.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.