From Monitoring to Decision: MySQL Capacity Planning with Prometheus & Grafana
This guide walks through building a Prometheus‑Grafana monitoring stack for MySQL, selecting exporters, defining key metric groups, leveraging Performance Schema for deep insights, configuring tiered alerts, and applying trend‑based capacity planning to anticipate resource needs.
Prometheus + Grafana Monitoring
Goal: Build a complete MySQL monitoring stack and understand the business meaning behind each chart.
Exporter selection
mysqld_exporter: Official exporter that collects SHOW GLOBAL STATUS and SHOW GLOBAL VARIABLES. percona/mysqld_exporter: Percona‑customized version that also gathers performance_schema and information_schema. Recommended.
Key metric groups for Grafana dashboards
① Throughput & Load mysql_global_status_questions: QPS (queries per second). mysql_global_status_threads_connected: Connection count (water‑level). mysql_global_status_threads_running: Running threads (actual load).
② Latency & Response mysql_global_status_slow_queries: Slow‑query count. mysql_global_status_created_tmp_tables: Temporary tables; high values indicate index loss.
③ InnoDB Engine State mysql_global_status_innodb_data_reads / mysql_global_status_innodb_data_writes: Physical I/O. mysql_global_status_innodb_row_lock_waits: Row‑lock wait count. mysql_global_status_innodb_log_waits: Redo‑log write wait (memory shortage or I/O bottleneck).
④ Resource Utilization
CPU / Memory / Disk collected by node_exporter.
Disk I/O latency (iowait).
Layered monitoring – business layer watches QPS and connections, database layer watches locks, temp tables, slow queries, and system layer watches I/O – speeds root‑cause identification.
Performance Schema Deep Use
Goal: Expose MySQL’s internal state as queryable metrics.
Core configuration
[mysqld]
performance_schema=ON
performance-schema-consumer-events_statements_summary_by_digest=ON
performance-schema-consumer-events_statements_history_long=ON
performance-schema-consumer-events_transactions_history_long=ON
performance-schema-consumer-statements_digest=ONFour practical scenarios
Scenario A – Who is eating CPU?
-- Show currently executing threads ordered by longest runtime
SELECT THREAD_ID, PROCESSLIST_ID, PROCESSLIST_USER, PROCESSLIST_DB,
PROCESSLIST_COMMAND, PROCESSLIST_TIME, PROCESSLIST_INFO
FROM performance_schema.threads
WHERE PROCESSLIST_COMMAND != 'Sleep'
ORDER BY PROCESSLIST_TIME DESC;Scenario B – Find the worst SQL
-- Rank statements by logical reads (memory‑intensive)
SELECT DIGEST_TEXT, COUNT_STAR, AVG_TIMER_WAIT/1e12 AS avg_ms,
SUM_ROWS_EXAMINED, SUM_ROWS_SENT,
(SUM_ROWS_EXAMINED / SUM_ROWS_SENT) AS scan_ratio
FROM performance_schema.events_statements_summary_by_digest
WHERE SUM_ROWS_SENT > 0
ORDER BY SUM_ROWS_EXAMINED DESC
LIMIT 10;Scenario C – Transaction lock wait analysis
-- Show transactions holding or waiting for locks
SELECT * FROM performance_schema.metadata_locks;
SELECT * FROM performance_schema.data_locks; -- MySQL 8.0Scenario D – Memory allocation tracing (MySQL 8.0)
-- Identify memory components with highest consumption
SELECT EVENT_NAME, CURRENT_NUMBER_OF_BYTES_USED
FROM performance_schema.memory_summary_global_by_event_name
ORDER BY 2 DESC
LIMIT 10;Key Metric Alert Configuration
Goal: Set effective thresholds to reduce alert noise.
Prometheus Alertmanager example
P0 – Immediate Action
groups:
- name: mysql_critical
rules:
- alert: MySQLDown
expr: mysql_up == 0
for: 1m
- alert: MySQLConnectionSaturation
expr: mysql_global_status_threads_connected / mysql_global_variables_max_connections > 0.85
for: 2mP1 – Needs Attention
- alert: ReplicationLag
expr: mysql_slave_status_seconds_behind_master > 60
for: 1m
- alert: SuddenSlowQueryIncrease
expr: rate(mysql_global_status_slow_queries[5m]) > 0.5P2 – Routine Optimization
- alert: TempTableDiskSpill
expr: rate(mysql_global_status_created_tmp_disk_tables[5m]) > 10Core alerting principles
Trend over static threshold : Prefer "disk growth rate > 5%/day" to a fixed "disk > 80%" alarm.
Time‑window aggregation : Use for: 2m to avoid transient spikes.
Capacity Planning: From Reactive Fire‑fighting to Proactive Defense
Goal: Answer "When to add memory?" and "When to shard?"
Trend extrapolation methodology
Use historical monitoring data to predict when a resource will be exhausted.
Example: Disk capacity planning
Collect node_filesystem_free_bytes for the past 90 days.
Model: Linear regression to compute daily decline rate.
Predict: Assuming constant rate, calculate the time when free space reaches zero.
Prometheus function used:
predict_linear(node_filesystem_free_bytes{device!~'rootfs'}[7d], 86400*7) < 0Four‑dimension capacity model
Business‑cycle based planning
E‑commerce: Before Double‑11, stress test with "peak TPS × 1.5".
SaaS: Estimate using "monthly active users × average SQL per user × 2‑year growth rate".
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Senior Xiao Ying
Dedicated to sharing Java backend technical experience and original tutorials, offering career transition advice and resume editing. Recognized as a rising star in CSDN's Java backend community and ranked Top 3 in the 2022 New Star Program for Java backend.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
