Databases 30 min read

How to Diagnose and Fix MySQL Production Performance Drops

This comprehensive guide walks you through identifying the root causes of MySQL performance degradation in production, from gathering baseline metrics and analyzing slow‑query logs to examining execution plans, tuning parameters, applying configuration changes, and verifying improvements with concrete monitoring and backup strategies.

Ops Community

Mar 8, 2026

How to Diagnose and Fix MySQL Production Performance Drops

Overview

Production MySQL performance can degrade for many reasons: slow‑query accumulation, insufficient InnoDB buffer pool, redo‑log bottlenecks, lock contention, replication lag, or mis‑configured connection‑pool parameters. Without a baseline the typical reaction is to add more machines, which often masks the real cause.

Preparation

System snapshot

Collect a consistent set of OS and MySQL metrics before any change. The following commands give a quick view of CPU, memory, disk I/O, and key MySQL status variables.

date

hostname -f

uptime

free -h

df -h

iostat -xz 1 3

vmstat 1 5

mysql -e "SELECT VERSION();"

mysql -e "SHOW GLOBAL STATUS LIKE 'Threads_running';"

mysql -e "SHOW GLOBAL STATUS LIKE 'Questions';"

mysql -e "SHOW GLOBAL STATUS LIKE 'Slow_queries';"

mysql -e "SHOW PROCESSLIST;"

For master‑slave setups also run SHOW SLAVE STATUS\G (or SHOW REPLICA STATUS\G) and

SHOW GLOBAL STATUS LIKE 'Seconds_Behind_Master';

Install diagnostic tools

# Debian / Ubuntu
sudo apt update
sudo apt install -y percona-toolkit sysstat iotop jq

# RHEL / CentOS / Rocky
sudo yum install -y percona-toolkit sysstat iotop jq

Enable the slow‑query log if it is not already on; it is essential for production debugging.

Diagnosis workflow

Step 1 – Establish a baseline

Query the most relevant global status variables to identify whether CPU, I/O, locks, or connection buildup is the primary bottleneck.

SHOW GLOBAL STATUS LIKE 'Threads_running';
SHOW GLOBAL STATUS LIKE 'Threads_connected';
SHOW GLOBAL STATUS LIKE 'Questions';
SHOW GLOBAL STATUS LIKE 'Slow_queries';
SHOW GLOBAL STATUS LIKE 'Innodb_buffer_pool_reads';
SHOW GLOBAL STATUS LIKE 'Innodb_row_lock_time';
SHOW GLOBAL STATUS LIKE 'Created_tmp_disk_tables';
SHOW GLOBAL STATUS LIKE 'Sort_merge_passes';

Threads_running

high → thread pile‑up, likely slow SQL. Innodb_buffer_pool_reads rising → buffer pool too small. Created_tmp_disk_tables high → temp tables spilling to disk. Innodb_row_lock_time high → lock‑wait amplification. await/util from iostat high → storage layer bottleneck.

Step 2 – Enable and analyze the slow‑query log

Set a reasonable threshold (1 s for most workloads, 500 ms for latency‑critical paths) and keep the log on.

SET GLOBAL slow_query_log = ON;
SET GLOBAL long_query_time = 1;
SET GLOBAL log_queries_not_using_indexes = OFF;
SHOW VARIABLES LIKE 'slow_query_log%';
SHOW VARIABLES LIKE 'long_query_time';

Persist the settings in my.cnf:

[mysqld]
slow_query_log=ON
slow_query_log_file=/var/log/mysql/slow.log
long_query_time=1
log_output=FILE
min_examined_row_limit=100

Analyze the log with Percona Toolkit or the built‑in utilities:

mysqldumpslow -s t -t 20 /var/log/mysql/slow.log
pt-query-digest /var/log/mysql/slow.log --limit=20

pt‑query‑digest

groups identical query templates and shows total execution time, which is more representative of production impact than a single outlier.

Step 3 – Review execution plans

For each hot template, examine the plan before adding indexes.

EXPLAIN SELECT * FROM orders WHERE user_id = 1024 AND created_at >= '2026-03-08 00:00:00';
EXPLAIN ANALYZE SELECT * FROM orders WHERE user_id = 1024 AND created_at >= '2026-03-08 00:00:00';
SHOW INDEX FROM orders;

type

= ALL or index → full scan, high cost. rows large → index ineffective. Extra contains Using temporary, Using filesort, Using where → focus areas. filtered low → index does not cover filter conditions.

Common pitfalls:

Index exists but column order is wrong, causing range‑scan truncation.

Stale statistics lead the optimizer to choose a bad plan. Refresh with ANALYZE TABLE tbl_name; when needed.

Applying changes and verification

Dynamic vs static variables

Dynamic variables can be changed online; static variables require a restart. Always back up before modifying.

mysql -e "SHOW VARIABLES LIKE 'innodb_buffer_pool_size';"
mysql -e "SET GLOBAL max_connections=800;"
mysql -e "SET GLOBAL innodb_flush_log_at_trx_commit=1;"
mysql -e "SHOW VARIABLES LIKE 'max_connections';"

If a restart is needed, schedule it outside peak hours:

sudo systemctl status mysqld --no-pager || sudo systemctl status mysql --no-pager

Verification criteria

Threads_running

clearly reduced.

Slow‑query volume drops; hot templates no longer dominate.

Disk await and util return to baseline levels.

Application response time recovers.

Sample configuration (my.cnf)

# File: /etc/my.cnf
[mysqld]
user=mysql
port=3306
bind-address=0.0.0.0
skip-name-resolve=ON
max_connections=800
max_connect_errors=1000
back_log=1024
table_open_cache=4096
table_definition_cache=4096
open_files_limit=65535

innodb_buffer_pool_size=24G
innodb_buffer_pool_instances=8
innodb_log_file_size=2G
innodb_log_files_in_group=2
innodb_flush_log_at_trx_commit=1
innodb_flush_method=O_DIRECT
innodb_io_capacity=2000
innodb_io_capacity_max=4000
innodb_flush_neighbors=0
innodb_read_io_threads=8
innodb_write_io_threads=8

tmp_table_size=256M
max_heap_table_size=256M
sort_buffer_size=4M
join_buffer_size=4M
read_buffer_size=2M
read_rnd_buffer_size=4M

slow_query_log=ON
slow_query_log_file=/var/log/mysql/slow.log
long_query_time=1
log_output=FILE
min_examined_row_limit=100

performance_schema=ON
character_set_server=utf8mb4
collation_server=utf8mb4_0900_ai_ci

[client]
default-character-set=utf8mb4

Key points: skip-name-resolve=ON removes DNS jitter on connections.

Allocate 60‑75 % of RAM to innodb_buffer_pool_size on a dedicated DB host.

Adjust tmp_table_size / max_heap_table_size to avoid excessive on‑disk temp tables.

Per‑connection buffers ( join_buffer_size, sort_buffer_size) should stay modest.

Helper script for on‑call snapshots

#!/usr/bin/env bash
set -euo pipefail
OUT_DIR="/var/log/mysql-perf/$(date +%F_%H%M%S)"
mkdir -p "$OUT_DIR"

mysql -e "SHOW GLOBAL STATUS LIKE 'Threads_running';" > "$OUT_DIR/threads_running.txt"
mysql -e "SHOW GLOBAL STATUS LIKE 'Threads_connected';" > "$OUT_DIR/threads_connected.txt"
mysql -e "SHOW GLOBAL STATUS LIKE 'Slow_queries';" > "$OUT_DIR/slow_queries.txt"
mysql -e "SHOW GLOBAL STATUS LIKE 'Innodb_buffer_pool_reads';" > "$OUT_DIR/buffer_pool_reads.txt"
mysql -e "SHOW GLOBAL STATUS LIKE 'Created_tmp_disk_tables';" > "$OUT_DIR/tmp_disk_tables.txt"
mysql -e "SHOW GLOBAL STATUS LIKE 'Sort_merge_passes';" > "$OUT_DIR/sort_merge_passes.txt"
mysql -e "SHOW GLOBAL STATUS LIKE 'Innodb_row_lock_time';" > "$OUT_DIR/row_lock_time.txt"
mysql -e "SHOW FULL PROCESSLIST;" > "$OUT_DIR/processlist.txt"
mysql -e "SHOW ENGINE INNODB STATUS\G" > "$OUT_DIR/innodb_status.txt"
mysql -e "SELECT * FROM performance_schema.events_statements_summary_by_digest ORDER BY SUM_TIMER_WAIT DESC LIMIT 20;" > "$OUT_DIR/digest_top20.txt" 2>/dev/null || true

tail -200 /var/log/mysql/slow.log > "$OUT_DIR/slow_tail.txt" 2>/dev/null || true
iostat -xz 1 3 > "$OUT_DIR/iostat.txt"
vmstat 1 5 > "$OUT_DIR/vmstat.txt"
date > "$OUT_DIR/timestamp.txt"

echo "snapshot saved in $OUT_DIR"

Case studies

Case 1 – Slow‑SQL template spike during promotion

During a flash‑sale the product‑list API latency rose from 120 ms to 2.8 s. Threads_running jumped from ~20 to 180 while CPU stayed at 55 %.

mysql -e "SHOW GLOBAL STATUS LIKE 'Threads_running';"
mysql -e "SHOW GLOBAL STATUS LIKE 'Slow_queries';"
pt-query-digest /var/log/mysql/slow.log --limit=10
mysql -e "EXPLAIN SELECT * FROM product WHERE category_id=8 AND status=1 ORDER BY updated_at DESC LIMIT 20;"

Findings:

Template executed ~23 k times per minute.

Single‑column index on category_id could not satisfy the ORDER BY updated_at DESC, causing Using filesort and scanning ~800 k rows per execution.

Solution – add a covering composite index and shorten the cache window:

ALTER TABLE product ADD INDEX idx_category_status_updated (category_id, status, updated_at DESC);

Result – P99 latency reduced to 210 ms, Threads_running fell to 26, and slow‑query count returned to near zero.

Case 2 – Buffer pool too small during month‑end settlement

Order‑service latency spiked (1‑3 s) while CPU was idle. Disk await rose to 18‑25 ms.

iostat -xz 1 5
mysql -e "SHOW GLOBAL STATUS LIKE 'Innodb_buffer_pool_reads';"
mysql -e "SHOW GLOBAL STATUS LIKE 'Innodb_buffer_pool_read_requests';"
mysql -e "SHOW VARIABLES LIKE 'innodb_buffer_pool_size';"

Findings: Innodb_buffer_pool_reads surged.

Buffer pool was only 8 GB on a 32 GB instance.

Batch settlement jobs competed with live traffic for hot pages.

Actions:

Throttle low‑priority settlement jobs.

Increase innodb_buffer_pool_size to 24 GB during off‑peak.

Split settlement batches to avoid simultaneous hot‑page contention.

Result – Disk await dropped to 2‑4 ms and order‑query P95 returned to ~120 ms.

Best practices & cautions

Performance

Keep the slow‑query log always on and set long_query_time to 1 s (or 500 ms for latency‑critical paths).

Prioritize index creation and SQL rewrite before adjusting global buffers; per‑connection buffers ( sort_buffer_size, join_buffer_size, tmp_table_size) are not global limits.

Allocate Buffer Pool 60‑75 % of RAM on a dedicated DB host; avoid swapping at all costs.

Security

Restrict changes to high‑risk parameters such as innodb_flush_log_at_trx_commit, sync_binlog, and read_only / super_read_only.

Test backup and point‑in‑time recovery procedures before any structural or parameter change.

Limit large unindexed queries and full‑table exports in production; enforce audit and quota.

High availability

Use primary‑replica or Group Replication; support read‑write failover.

Monitor replication lag and cut read traffic from lagging replicas promptly.

Maintain physical backups plus binlog for point‑in‑time recovery; snapshot before any change.

Monitoring & alerting

Key metrics to monitor

Threads_running

/ Threads_connected QPS / TPS

Slow_queries

Innodb_buffer_pool_read_requests

Innodb_buffer_pool_reads

Created_tmp_disk_tables

Innodb_row_lock_waits

/ Innodb_row_lock_time Redo‑log fsync latency, disk await / util Replication lag and binlog generation rate

Prometheus alert example

# File: prometheus/rules/mysql_perf.yml
groups:
- name: mysql-performance
  rules:
  - alert: MySQLThreadsRunningHigh
    expr: mysql_global_status_threads_running > 64
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "MySQL running threads too high"
      description: "{{ $labels.instance }} Threads_running continuously above 64"
  - alert: MySQLSlowQueriesSpike
    expr: rate(mysql_global_status_slow_queries[5m]) > 5
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "MySQL slow query surge"
      description: "{{ $labels.instance }} more than 5 slow queries per second"
  - alert: MySQLBufferPoolMissHigh
    expr: rate(mysql_global_status_innodb_buffer_pool_reads[5m]) > 100
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "MySQL Buffer Pool random reads high"
      description: "{{ $labels.instance }} high physical reads from Buffer Pool"

Backup and recovery workflow

Backup script (pre‑change snapshot)

#!/usr/bin/env bash
set -euo pipefail
OUT_DIR="/var/backups/mysql-prechange/$(date +%F_%H%M%S)"
mkdir -p "$OUT_DIR"

mysql -e "SHOW MASTER STATUS\G" > "$OUT_DIR/master_status.txt" 2>/dev/null || true
mysql -e "SHOW BINARY LOGS;" > "$OUT_DIR/binary_logs.txt"
mysql -e "SHOW VARIABLES;" > "$OUT_DIR/variables.txt"
mysql -e "SHOW GLOBAL STATUS;" > "$OUT_DIR/global_status.txt"

tar czf "${OUT_DIR}.tar.gz" "$OUT_DIR"
echo "backup saved to ${OUT_DIR}.tar.gz"

Recovery steps

Stop high‑risk changes (DDL, bulk imports, large cleanup jobs).

Rollback my.cnf or dynamic variables to the pre‑change version.

Validate instance start‑up, replication health, key tables read/write, and business sanity checks.

If needed, perform point‑in‑time restore using the latest physical backup plus binlog.

Conclusion

Always establish a baseline before any change.

Focus on slow‑SQL templates rather than isolated outliers.

Execution plans must align with business filters; correct index order matters.

Parameter tuning must consider the workload model (connections, I/O, Buffer Pool).

CPU alone is not a reliable health indicator; monitor I/O, locks, and replication.

Never modify production without a tested backup and rollback plan.

Performance Tuning MySQL replication database optimization Backup

Written by

Ops Community

A leading IT operations community where professionals share and grow together.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Overview

Preparation

System snapshot

Install diagnostic tools

Diagnosis workflow

Step 1 – Establish a baseline

Step 2 – Enable and analyze the slow‑query log

Step 3 – Review execution plans

Applying changes and verification

Dynamic vs static variables

Verification criteria

Sample configuration (my.cnf)

Helper script for on‑call snapshots

Case studies

Case 1 – Slow‑SQL template spike during promotion

Case 2 – Buffer pool too small during month‑end settlement

Best practices & cautions

Performance

Security

High availability

Monitoring & alerting

Key metrics to monitor

Prometheus alert example

Backup and recovery workflow

Backup script (pre‑change snapshot)

Recovery steps

Conclusion

Ops Community

How this landed with the community

Was this worth your time?

0 Comments

Step 1 – Establish a baseline

Step 2 – Enable and analyze the slow‑query log

Step 3 – Review execution plans

Case 1 – Slow‑SQL template spike during promotion

Case 2 – Buffer pool too small during month‑end settlement