How to Diagnose and Fix MySQL Production Performance Drops
This comprehensive guide walks you through identifying the root causes of MySQL performance degradation in production, from gathering baseline metrics and analyzing slow‑query logs to examining execution plans, tuning parameters, applying configuration changes, and verifying improvements with concrete monitoring and backup strategies.
Overview
Production MySQL performance can degrade for many reasons: slow‑query accumulation, insufficient InnoDB buffer pool, redo‑log bottlenecks, lock contention, replication lag, or mis‑configured connection‑pool parameters. Without a baseline the typical reaction is to add more machines, which often masks the real cause.
Preparation
System snapshot
Collect a consistent set of OS and MySQL metrics before any change. The following commands give a quick view of CPU, memory, disk I/O, and key MySQL status variables.
date hostname -f uptime free -h df -h iostat -xz 1 3 vmstat 1 5 mysql -e "SELECT VERSION();" mysql -e "SHOW GLOBAL STATUS LIKE 'Threads_running';" mysql -e "SHOW GLOBAL STATUS LIKE 'Questions';" mysql -e "SHOW GLOBAL STATUS LIKE 'Slow_queries';" mysql -e "SHOW PROCESSLIST;"For master‑slave setups also run SHOW SLAVE STATUS\G (or SHOW REPLICA STATUS\G) and
SHOW GLOBAL STATUS LIKE 'Seconds_Behind_Master';Install diagnostic tools
# Debian / Ubuntu
sudo apt update
sudo apt install -y percona-toolkit sysstat iotop jq
# RHEL / CentOS / Rocky
sudo yum install -y percona-toolkit sysstat iotop jqEnable the slow‑query log if it is not already on; it is essential for production debugging.
Diagnosis workflow
Step 1 – Establish a baseline
Query the most relevant global status variables to identify whether CPU, I/O, locks, or connection buildup is the primary bottleneck.
SHOW GLOBAL STATUS LIKE 'Threads_running';
SHOW GLOBAL STATUS LIKE 'Threads_connected';
SHOW GLOBAL STATUS LIKE 'Questions';
SHOW GLOBAL STATUS LIKE 'Slow_queries';
SHOW GLOBAL STATUS LIKE 'Innodb_buffer_pool_reads';
SHOW GLOBAL STATUS LIKE 'Innodb_row_lock_time';
SHOW GLOBAL STATUS LIKE 'Created_tmp_disk_tables';
SHOW GLOBAL STATUS LIKE 'Sort_merge_passes'; Threads_runninghigh → thread pile‑up, likely slow SQL. Innodb_buffer_pool_reads rising → buffer pool too small. Created_tmp_disk_tables high → temp tables spilling to disk. Innodb_row_lock_time high → lock‑wait amplification. await/util from iostat high → storage layer bottleneck.
Step 2 – Enable and analyze the slow‑query log
Set a reasonable threshold (1 s for most workloads, 500 ms for latency‑critical paths) and keep the log on.
SET GLOBAL slow_query_log = ON;
SET GLOBAL long_query_time = 1;
SET GLOBAL log_queries_not_using_indexes = OFF;
SHOW VARIABLES LIKE 'slow_query_log%';
SHOW VARIABLES LIKE 'long_query_time';Persist the settings in my.cnf:
[mysqld]
slow_query_log=ON
slow_query_log_file=/var/log/mysql/slow.log
long_query_time=1
log_output=FILE
min_examined_row_limit=100Analyze the log with Percona Toolkit or the built‑in utilities:
mysqldumpslow -s t -t 20 /var/log/mysql/slow.log
pt-query-digest /var/log/mysql/slow.log --limit=20 pt‑query‑digestgroups identical query templates and shows total execution time, which is more representative of production impact than a single outlier.
Step 3 – Review execution plans
For each hot template, examine the plan before adding indexes.
EXPLAIN SELECT * FROM orders WHERE user_id = 1024 AND created_at >= '2026-03-08 00:00:00';
EXPLAIN ANALYZE SELECT * FROM orders WHERE user_id = 1024 AND created_at >= '2026-03-08 00:00:00';
SHOW INDEX FROM orders; type= ALL or index → full scan, high cost. rows large → index ineffective. Extra contains Using temporary, Using filesort, Using where → focus areas. filtered low → index does not cover filter conditions.
Common pitfalls:
Index exists but column order is wrong, causing range‑scan truncation.
Stale statistics lead the optimizer to choose a bad plan. Refresh with ANALYZE TABLE tbl_name; when needed.
Applying changes and verification
Dynamic vs static variables
Dynamic variables can be changed online; static variables require a restart. Always back up before modifying.
mysql -e "SHOW VARIABLES LIKE 'innodb_buffer_pool_size';"
mysql -e "SET GLOBAL max_connections=800;"
mysql -e "SET GLOBAL innodb_flush_log_at_trx_commit=1;"
mysql -e "SHOW VARIABLES LIKE 'max_connections';"If a restart is needed, schedule it outside peak hours:
sudo systemctl status mysqld --no-pager || sudo systemctl status mysql --no-pagerVerification criteria
Threads_runningclearly reduced.
Slow‑query volume drops; hot templates no longer dominate.
Disk await and util return to baseline levels.
Application response time recovers.
Sample configuration (my.cnf)
# File: /etc/my.cnf
[mysqld]
user=mysql
port=3306
bind-address=0.0.0.0
skip-name-resolve=ON
max_connections=800
max_connect_errors=1000
back_log=1024
table_open_cache=4096
table_definition_cache=4096
open_files_limit=65535
innodb_buffer_pool_size=24G
innodb_buffer_pool_instances=8
innodb_log_file_size=2G
innodb_log_files_in_group=2
innodb_flush_log_at_trx_commit=1
innodb_flush_method=O_DIRECT
innodb_io_capacity=2000
innodb_io_capacity_max=4000
innodb_flush_neighbors=0
innodb_read_io_threads=8
innodb_write_io_threads=8
tmp_table_size=256M
max_heap_table_size=256M
sort_buffer_size=4M
join_buffer_size=4M
read_buffer_size=2M
read_rnd_buffer_size=4M
slow_query_log=ON
slow_query_log_file=/var/log/mysql/slow.log
long_query_time=1
log_output=FILE
min_examined_row_limit=100
performance_schema=ON
character_set_server=utf8mb4
collation_server=utf8mb4_0900_ai_ci
[client]
default-character-set=utf8mb4Key points: skip-name-resolve=ON removes DNS jitter on connections.
Allocate 60‑75 % of RAM to innodb_buffer_pool_size on a dedicated DB host.
Adjust tmp_table_size / max_heap_table_size to avoid excessive on‑disk temp tables.
Per‑connection buffers ( join_buffer_size, sort_buffer_size) should stay modest.
Helper script for on‑call snapshots
#!/usr/bin/env bash
set -euo pipefail
OUT_DIR="/var/log/mysql-perf/$(date +%F_%H%M%S)"
mkdir -p "$OUT_DIR"
mysql -e "SHOW GLOBAL STATUS LIKE 'Threads_running';" > "$OUT_DIR/threads_running.txt"
mysql -e "SHOW GLOBAL STATUS LIKE 'Threads_connected';" > "$OUT_DIR/threads_connected.txt"
mysql -e "SHOW GLOBAL STATUS LIKE 'Slow_queries';" > "$OUT_DIR/slow_queries.txt"
mysql -e "SHOW GLOBAL STATUS LIKE 'Innodb_buffer_pool_reads';" > "$OUT_DIR/buffer_pool_reads.txt"
mysql -e "SHOW GLOBAL STATUS LIKE 'Created_tmp_disk_tables';" > "$OUT_DIR/tmp_disk_tables.txt"
mysql -e "SHOW GLOBAL STATUS LIKE 'Sort_merge_passes';" > "$OUT_DIR/sort_merge_passes.txt"
mysql -e "SHOW GLOBAL STATUS LIKE 'Innodb_row_lock_time';" > "$OUT_DIR/row_lock_time.txt"
mysql -e "SHOW FULL PROCESSLIST;" > "$OUT_DIR/processlist.txt"
mysql -e "SHOW ENGINE INNODB STATUS\G" > "$OUT_DIR/innodb_status.txt"
mysql -e "SELECT * FROM performance_schema.events_statements_summary_by_digest ORDER BY SUM_TIMER_WAIT DESC LIMIT 20;" > "$OUT_DIR/digest_top20.txt" 2>/dev/null || true
tail -200 /var/log/mysql/slow.log > "$OUT_DIR/slow_tail.txt" 2>/dev/null || true
iostat -xz 1 3 > "$OUT_DIR/iostat.txt"
vmstat 1 5 > "$OUT_DIR/vmstat.txt"
date > "$OUT_DIR/timestamp.txt"
echo "snapshot saved in $OUT_DIR"Case studies
Case 1 – Slow‑SQL template spike during promotion
During a flash‑sale the product‑list API latency rose from 120 ms to 2.8 s. Threads_running jumped from ~20 to 180 while CPU stayed at 55 %.
mysql -e "SHOW GLOBAL STATUS LIKE 'Threads_running';"
mysql -e "SHOW GLOBAL STATUS LIKE 'Slow_queries';"
pt-query-digest /var/log/mysql/slow.log --limit=10
mysql -e "EXPLAIN SELECT * FROM product WHERE category_id=8 AND status=1 ORDER BY updated_at DESC LIMIT 20;"Findings:
Template executed ~23 k times per minute.
Single‑column index on category_id could not satisfy the ORDER BY updated_at DESC, causing Using filesort and scanning ~800 k rows per execution.
Solution – add a covering composite index and shorten the cache window:
ALTER TABLE product ADD INDEX idx_category_status_updated (category_id, status, updated_at DESC);Result – P99 latency reduced to 210 ms, Threads_running fell to 26, and slow‑query count returned to near zero.
Case 2 – Buffer pool too small during month‑end settlement
Order‑service latency spiked (1‑3 s) while CPU was idle. Disk await rose to 18‑25 ms.
iostat -xz 1 5
mysql -e "SHOW GLOBAL STATUS LIKE 'Innodb_buffer_pool_reads';"
mysql -e "SHOW GLOBAL STATUS LIKE 'Innodb_buffer_pool_read_requests';"
mysql -e "SHOW VARIABLES LIKE 'innodb_buffer_pool_size';"Findings: Innodb_buffer_pool_reads surged.
Buffer pool was only 8 GB on a 32 GB instance.
Batch settlement jobs competed with live traffic for hot pages.
Actions:
Throttle low‑priority settlement jobs.
Increase innodb_buffer_pool_size to 24 GB during off‑peak.
Split settlement batches to avoid simultaneous hot‑page contention.
Result – Disk await dropped to 2‑4 ms and order‑query P95 returned to ~120 ms.
Best practices & cautions
Performance
Keep the slow‑query log always on and set long_query_time to 1 s (or 500 ms for latency‑critical paths).
Prioritize index creation and SQL rewrite before adjusting global buffers; per‑connection buffers ( sort_buffer_size, join_buffer_size, tmp_table_size) are not global limits.
Allocate Buffer Pool 60‑75 % of RAM on a dedicated DB host; avoid swapping at all costs.
Security
Restrict changes to high‑risk parameters such as innodb_flush_log_at_trx_commit, sync_binlog, and read_only / super_read_only.
Test backup and point‑in‑time recovery procedures before any structural or parameter change.
Limit large unindexed queries and full‑table exports in production; enforce audit and quota.
High availability
Use primary‑replica or Group Replication; support read‑write failover.
Monitor replication lag and cut read traffic from lagging replicas promptly.
Maintain physical backups plus binlog for point‑in‑time recovery; snapshot before any change.
Monitoring & alerting
Key metrics to monitor
Threads_running/ Threads_connected QPS / TPS
Slow_queries Innodb_buffer_pool_read_requestsvs
Innodb_buffer_pool_reads Created_tmp_disk_tables Innodb_row_lock_waits/ Innodb_row_lock_time Redo‑log fsync latency, disk await / util Replication lag and binlog generation rate
Prometheus alert example
# File: prometheus/rules/mysql_perf.yml
groups:
- name: mysql-performance
rules:
- alert: MySQLThreadsRunningHigh
expr: mysql_global_status_threads_running > 64
for: 5m
labels:
severity: warning
annotations:
summary: "MySQL running threads too high"
description: "{{ $labels.instance }} Threads_running continuously above 64"
- alert: MySQLSlowQueriesSpike
expr: rate(mysql_global_status_slow_queries[5m]) > 5
for: 5m
labels:
severity: critical
annotations:
summary: "MySQL slow query surge"
description: "{{ $labels.instance }} more than 5 slow queries per second"
- alert: MySQLBufferPoolMissHigh
expr: rate(mysql_global_status_innodb_buffer_pool_reads[5m]) > 100
for: 10m
labels:
severity: warning
annotations:
summary: "MySQL Buffer Pool random reads high"
description: "{{ $labels.instance }} high physical reads from Buffer Pool"Backup and recovery workflow
Backup script (pre‑change snapshot)
#!/usr/bin/env bash
set -euo pipefail
OUT_DIR="/var/backups/mysql-prechange/$(date +%F_%H%M%S)"
mkdir -p "$OUT_DIR"
mysql -e "SHOW MASTER STATUS\G" > "$OUT_DIR/master_status.txt" 2>/dev/null || true
mysql -e "SHOW BINARY LOGS;" > "$OUT_DIR/binary_logs.txt"
mysql -e "SHOW VARIABLES;" > "$OUT_DIR/variables.txt"
mysql -e "SHOW GLOBAL STATUS;" > "$OUT_DIR/global_status.txt"
tar czf "${OUT_DIR}.tar.gz" "$OUT_DIR"
echo "backup saved to ${OUT_DIR}.tar.gz"Recovery steps
Stop high‑risk changes (DDL, bulk imports, large cleanup jobs).
Rollback my.cnf or dynamic variables to the pre‑change version.
Validate instance start‑up, replication health, key tables read/write, and business sanity checks.
If needed, perform point‑in‑time restore using the latest physical backup plus binlog.
Conclusion
Always establish a baseline before any change.
Focus on slow‑SQL templates rather than isolated outliers.
Execution plans must align with business filters; correct index order matters.
Parameter tuning must consider the workload model (connections, I/O, Buffer Pool).
CPU alone is not a reliable health indicator; monitor I/O, locks, and replication.
Never modify production without a tested backup and rollback plan.
Ops Community
A leading IT operations community where professionals share and grow together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
