Databases 44 min read

How to Completely Resolve MySQL CPU Spikes: Real‑World Fault Replay and Optimization Guide

This article walks you through a systematic, step‑by‑step process for diagnosing and fixing MySQL CPU usage spikes—from identifying the symptoms and gathering system metrics, to pinpointing problematic queries, analyzing locks and buffers, applying index and configuration tweaks, and validating the performance gains with real‑world examples and command‑line tools.

Ops Community
Ops Community
Ops Community
How to Completely Resolve MySQL CPU Spikes: Real‑World Fault Replay and Optimization Guide

Problem Background and Applicable Scenarios

MySQL CPU usage can spike suddenly and stay high, causing slow responses, connection pile‑up and service outages. Typical triggers include:

Flash‑sale traffic bursts

Accumulated slow queries during peak hours

Poor schema or missing indexes leading to full‑table scans and temporary tables

Lock contention or long‑running transactions

Mis‑configured buffers, connections or thread pools

Massive UPDATE/DELETE or ALTER TABLE operations

Applicable environments: Linux servers running MySQL 5.7/8.0, cloud‑hosted instances (Alibaba Cloud, Tencent Cloud, AWS, etc.), and DBAs/DevOps engineers who need a systematic troubleshooting methodology.

Common Causes of MySQL CPU Spikes

Query‑level Issues

Slow query : execution time exceeds long_query_time, often with full‑table scans, temporary tables or file sorts. Typical culprits are missing indexes, mismatched join types, SELECT *, or lack of pagination.

Full table scan : optimizer chooses a scan over an index, causing row‑by‑row parsing.

Temporary table & file sort : GROUP BY / ORDER BY that cannot use an index creates on‑disk temporary tables or in‑memory sorts, increasing CPU.

Cartesian join : missing or wrong join conditions produce a Cartesian product, exploding result size and CPU usage.

Lock & Concurrency Issues

Long‑running transactions hold row or table locks, blocking other writes.

Coarse lock granularity : table locks or large range scans under REPEATABLE‑READ increase contention.

Gap locks & next‑key locks during unique‑index range scans can cause heavy competition on bulk inserts.

Hot‑row contention : many transactions updating the same row cause rollbacks and CPU waste.

Connection & Thread Issues

Max connections exhausted : low max_connections or connection leaks queue threads, consuming CPU.

Frequent thread creation/destruction : using thread_handling=0 (one thread per connection) without a thread pool makes short‑lived connections expensive.

Improper thread‑pool settings : mis‑configured thread_pool_size leads to contention or under‑utilisation of CPU cores.

Buffer & Cache Issues

InnoDB buffer pool too small : frequent page swaps increase I/O wait and indirectly raise CPU.

Query cache abuse (MySQL 5.7‑and‑earlier) : writes invalidate cache entries, causing lock contention.

Binlog sync overhead : sync_binlog=1 forces a disk flush on every transaction, adding CPU under high write rates.

Architecture & Infrastructure Issues

Multiple MySQL instances sharing the same server or competing with other CPU‑intensive processes.

NUMA memory allocation imbalance when innodb_numa_interleave=OFF.

Disk I/O bottlenecks (HDD vs SSD, RAID mis‑configuration).

CPU hardware problems or BIOS power‑management settings disabled.

Troubleshooting Process and Practical Steps

Step 1 – Confirm Symptoms and Impact Scope

When an alert appears, first verify the scope before logging into MySQL.

Verify alert information (monitoring tools such as Prometheus, Grafana, Zabbix): CPU usage (user, system, iowait), current MySQL connections vs max_connections, average and P99 query latency, QPS/TPS, slow‑query count in the last 5 / 15 minutes. Treat CPU > 80 % for > 5 minutes as a serious issue.

# Check MySQL service status
systemctl status mysql
# Or
ps -ef | grep mysqld

# Show MySQL version
mysql -e "SELECT VERSION();"

# Count current connections
mysql -e "SHOW PROCESSLIST;" | wc -l

Ask the application team whether all APIs are slow or only specific ones, and whether reads or writes are affected.

Step 2 – System‑Level Checks (Exclude Non‑MySQL Factors)

Process CPU : top -c or pidstat -p $(pidof mysqld) -u 1 5. If mysqld is the top CPU consumer, the problem is inside MySQL.

Load average : uptime. A 1‑minute load far higher than 5‑ and 15‑minute averages indicates a recent burst.

CPU core utilisation : mpstat -P ALL 1 3. Single‑thread 100 % usage suggests insufficient concurrency; many fully‑busy cores indicate query‑driven load.

Disk I/O : iostat -xz 1 3. Pay attention to %util (near 100 % = bottleneck), await (>10 ms SSD, >50 ms HDD) and avgqu‑sz .

Memory : free -h. If available memory is tiny and swap is used, swapping can dramatically increase CPU load.

Step 3 – MySQL‑Level Investigation (Locate the Problematic Query)

Running queries :

SHOW FULL PROCESSLIST;
-- MySQL 8.0 preferred view
SELECT p.id, p.user, p.host, p.db, p.command, p.time, p.state,
       LEFT(p.info,100) AS current_query,
       ROUND(t.timer_wait/1000000000000,3) AS exec_time_sec
FROM performance_schema.events_statements_current c
JOIN performance_schema.threads t ON c.thread_id = t.thread_id
JOIN information_schema.processlist p ON p.id = t.processlist_id
WHERE p.command != 'Sleep'
ORDER BY exec_time_sec DESC
LIMIT 20;

Explain plan for the identified slow query:

EXPLAIN SELECT u.id, u.username, u.email, o.order_id, o.total_amount
FROM users u
LEFT JOIN orders o ON u.id = o.user_id
WHERE u.created_at > '2024-01-01'
ORDER BY u.id DESC
LIMIT 100;

Key fields: type (ALL = full scan), key (actual index), rows (estimated rows), Extra (filesort, temporary, using where, using index condition).

EXPLAIN ANALYZE (MySQL 8.0) runs the query and returns actual execution statistics (use on a test replica).

Enable slow‑query log if not already on:

SHOW VARIABLES LIKE 'slow_query%';
SHOW VARIABLES LIKE 'long_query_time';
SET GLOBAL log_output = 'TABLE';
SET GLOBAL slow_query_log = ON;
SET GLOBAL long_query_time = 1;  -- seconds
SET GLOBAL log_queries_not_using_indexes = ON;
SELECT * FROM mysql.slow_log ORDER BY start_time DESC LIMIT 20;

Aggregate analysis with Percona Toolkit:

# yum install percona-toolkit -y
pt-query-digest /var/log/mysql/slow.log \
  --since='2024-06-15 10:00:00' \
  --until='2024-06-15 11:00:00' \
  --limit=20 \
  --report-format=profile

Step 4 – Deep Analysis (Locks, Connections, Buffer Pool)

InnoDB lock waits :

SELECT r.trx_id AS waiting_trx_id,
       r.trx_mysql_thread_id AS waiting_thread,
       r.trx_query AS waiting_query,
       b.trx_id AS blocking_trx_id,
       b.trx_mysql_thread_id AS blocking_thread,
       b.trx_query AS blocking_query,
       b.trx_started AS blocking_started,
       b.trx_rows_locked AS blocking_rows_locked,
       b.trx_tables_locked AS blocking_tables_locked
FROM information_schema.INNODB_LOCK_WAITS w
JOIN information_schema.INNODB_TRX b ON w.blocking_trx_id = b.trx_id
JOIN information_schema.INNODB_TRX r ON w.requesting_trx_id = r.trx_id;

SELECT * FROM information_schema.INNODB_TRX WHERE trx_state='LOCK WAIT' ORDER BY trx_started;

Connection statistics :

SHOW STATUS LIKE 'Threads%';
-- Example output:
-- Threads_connected: 245
-- Threads_created: 1234
-- Threads_running: 15
-- Threads_cached: 20

High Threads_running indicates heavy concurrency.

InnoDB buffer pool health :

SHOW STATUS LIKE 'Innodb_buffer_pool%';
-- Calculate hit rate: 1 - (Innodb_buffer_pool_reads / Innodb_buffer_pool_read_requests)
-- Hit rate < 99 % suggests the pool is undersized.
SHOW VARIABLES LIKE 'innodb_buffer_pool_size';
SHOW VARIABLES LIKE 'innodb_buffer_pool_instances';

Temporary table and sort statistics :

SHOW STATUS LIKE 'Created_tmp%';
SHOW STATUS LIKE 'Sort%';
-- High Created_tmp_disk_tables indicates many on‑disk temp tables.
-- High Sort_merge_passes suggests increasing <code>sort_buffer_size</code>.

Step 5 – Index Optimization and SQL Rewrites

Covering index example :

CREATE INDEX idx_orders_covering ON orders(status, created_at)
INCLUDE (order_id, user_id, total_amount);

Left‑most prefix rule : in a composite index the query must filter on the leftmost columns.

SQL rewrite tips :

Avoid SELECT *; fetch only needed columns.

Break large scans into primary‑key ranges.

Use LIMIT with a primary‑key condition for deep pagination.

Replace IN (subquery) with EXISTS or a JOIN when the subquery returns many rows.

Example rewrite:

-- Inefficient
SELECT * FROM users WHERE id IN (SELECT user_id FROM orders WHERE total_amount>1000);

-- Efficient
SELECT u.* FROM users u
WHERE EXISTS (SELECT 1 FROM orders o WHERE o.user_id=u.id AND o.total_amount>1000);

Step 6 – Configuration Tuning

Connection & thread settings (my.cnf) :

max_connections = 2000
thread_cache_size = 50
wait_timeout = 600
interactive_timeout = 600
max_allowed_packet = 64M

Thread‑cache hit rate = (1 - Threads_created/Connections) * 100 %.

InnoDB core settings :

innodb_buffer_pool_size = 20G
innodb_buffer_pool_instances = 8
innodb_log_file_size = 2G
innodb_flush_log_at_trx_commit = 1   -- 0/1/2 trade‑off safety vs performance
sync_binlog = 1                     -- safest; increase N for less sync overhead
innodb_thread_concurrency = 0       -- let InnoDB auto‑tune
innodb_io_capacity = 2000
innodb_io_capacity_max = 4000
innodb_lock_wait_timeout = 50
innodb_deadlock_detect = ON

Query cache (MySQL 5.7 and earlier) :

query_cache_type = 0
query_cache_size = 0

MySQL 8.0 has removed the feature.

Step 7 – Fix, Verify, and Rollback

Backup (logical or physical) before any change.

# Logical backup
mysqldump -uroot -p --single-transaction --routines --triggers --events \
  --master-data=2 --flush-logs --all-databases > /backup/mysql/full_$(date +%Y%m%d_%H%M%S).sql

# Physical hot backup (Percona XtraBackup)
xtrabackup --backup --target-dir=/backup/mysql/ --user=root --password=xxx

Gray‑scale test on a staging environment; verify execution plan and performance.

Production window : schedule during low‑traffic hours (e.g., 02:00‑06:00).

Deploy and monitor CPU, connections, QPS, slow‑query count, latency.

Rollback plan : keep original my.cnf lines commented, prepare DROP INDEX scripts, or use online schema‑change tools for safe revert.

# Drop index if needed
DROP INDEX idx_orders_status_created ON orders;

Verification :

Compare slow‑query execution time before/after (e.g., SET profiling=1;).

Check buffer‑pool hit rate: SHOW STATUS LIKE 'Innodb_buffer_pool%';.

Run a real‑world benchmark (sysbench) to compare QPS and latency.

sysbench /usr/share/sysbench/oltp_read_only.lua \
  --mysql-host=localhost --mysql-port=3306 \
  --mysql-user=root --mysql-password=xxx \
  --mysql-db=test --tables=10 --table-size=1000000 \
  --threads=16 --time=60 run

Real‑World Case: MySQL CPU Spike Incident

Symptom

At 03:00 AM CPU usage exceeded 90 % for 10 minutes; order‑creation API latency jumped from ~50 ms to 800‑2000 ms with occasional timeouts.

Initial Confirmation

# Verify with top
top -b -n 1 | grep mysqld
# Load average
uptime
# Example: load average 18.42, 15.36, 12.36 on a 16‑core box

MySQL‑Level Findings

SHOW FULL PROCESSLIST;
-- About 30 connections in state "Sending data" with execution times 8‑45 s.

SELECT order_id, user_id, product_name, quantity, price, created_at
FROM orders o
JOIN products p ON o.product_id = p.id
WHERE o.status = 'pending'
  AND o.created_at >= '2024-06-01'
ORDER BY o.created_at DESC;

Root‑Cause Identification

EXPLAIN SELECT ...;
-- type: ALL (full table scan)
-- key: NULL
-- rows: 15,238,472
-- Extra: Using filesort

The orders table (≈15 M rows) lacked indexes on status and created_at, forcing a full scan and a filesort, which drove the CPU spike.

Fix Implementation

# Online index creation without locking
pt-online-schema-change \
  --alter "ADD INDEX idx_orders_status_created (status, created_at) INCLUDE (order_id, user_id, product_id)" \
  --user=root --password=xxx \
  D=shop,t=orders \
  --execute

Verification

EXPLAIN SELECT ...;
-- type: range
-- key: idx_orders_status_created
-- rows: ~28,530
-- Extra: Using index condition (no filesort)

pt-query-digest /var/log/mysql/slow.log --since='2024-06-15 00:00:00' --until='2024-06-15 01:00:00' --limit=20
-- Average execution time dropped from 12.5 s to 0.08 s (≈150× faster).
-- CPU usage fell from 90 % to 25 % within 5 minutes, stabilising around 15 %.

Root‑Cause Summary

No index on status and created_at caused a full‑table scan.

Filesort added heavy CPU consumption.

The nightly batch query on 15 M rows amplified the problem.

Command Cheat Sheet

System‑Level Commands

# Process CPU
top -c
htop
pidstat -p $(pidof mysqld) -u 1 5

# Load average
uptime
w

# CPU per core
mpstat -P ALL 1 3

# Disk I/O
iostat -xz 1 3

# Memory
free -h
cat /proc/meminfo

# Swap
swapon -s

MySQL‑Level Commands

# Version
SELECT VERSION();

# All connections
SHOW FULL PROCESSLIST;

# Connection statistics
SHOW STATUS LIKE 'Threads%';

# InnoDB status
SHOW ENGINE INNODB STATUS;

# Lock waits
SELECT * FROM information_schema.INNODB_LOCK_WAITS;
SELECT * FROM information_schema.INNODB_TRX WHERE trx_state='LOCK WAIT';

# Buffer pool statistics
SHOW STATUS LIKE 'Innodb_buffer_pool%';
SHOW VARIABLES LIKE 'innodb_buffer_pool_size';
SHOW VARIABLES LIKE 'innodb_buffer_pool_instances';

# Slow‑query configuration
SHOW VARIABLES LIKE 'slow_query%';
SHOW VARIABLES LIKE 'long_query_time';
SET GLOBAL slow_query_log = ON;
SET GLOBAL long_query_time = 1;
SET GLOBAL log_queries_not_using_indexes = ON;
SET GLOBAL log_output = 'TABLE';
SELECT * FROM mysql.slow_log ORDER BY start_time DESC LIMIT 20;

# Explain plan
EXPLAIN SELECT ...;
EXPLAIN ANALYZE SELECT ...;  -- runs the query

# Index inspection
SHOW INDEX FROM orders;
ANALYZE TABLE orders;

# Online DDL (no lock)
pt-online-schema-change --alter "ADD INDEX idx_xxx (col1, col2)" --user=root --password=xxx D=db,t=table --execute
gh-ost --database="db" --table="table" --alter="ADD INDEX idx_xxx (col1, col2)" --user="root" --password="xxx" --execute

Backup & Restore

# Logical backup
mysqldump -uroot -p --single-transaction --routines --triggers --events \
  --master-data=2 --flush-logs --all-databases > /backup/mysql/full_$(date +%Y%m%d_%H%M%S).sql

# Physical hot backup (Percona XtraBackup)
xtrabackup --backup --target-dir=/backup/mysql/ --user=root --password=xxx
xtrabackup --prepare --target-dir=/backup/mysql/

# Restore logical backup
mysql -uroot -p < /backup/mysql/full_20240615_030000.sql

Performance Stress Test

# sysbench preparation
sysbench /usr/share/sysbench/oltp_read_only.lua \
  --mysql-host=localhost --mysql-port=3306 \
  --mysql-user=root --mysql-password=xxx \
  --mysql-db=test --tables=10 --table-size=1000000 prepare

# Run test
sysbench /usr/share/sysbench/oltp_read_only.lua \
  --mysql-host=localhost --mysql-port=3306 \
  --mysql-user=root --mysql-password=xxx \
  --mysql-db=test --tables=10 --table-size=1000000 \
  --threads=16 --time=60 run

# Real‑time MySQL status
mysqladmin -uroot -p -i 1 extended-status

Precautions and Risk Points

Production‑Level Risks

Backup first : any schema change, index addition or configuration tweak must be preceded by a full backup.

Online DDL : use pt-online-schema-change or gh‑ost for large tables to avoid table locks.

Index impact on writes : each new index adds write overhead; for write‑heavy tables consider adding indexes during low‑traffic windows or after bulk loads.

Configuration changes (e.g., innodb_buffer_pool_size) often require a MySQL restart; test in a staging environment first.

Investigation Risks

EXPLAIN ANALYZE executes the query : on large tables it can add load; run on a replica or test server.

Enabling log_queries_not_using_indexes may generate massive log files; monitor disk usage.

PROCESSLIST queries can acquire internal locks under extreme concurrency; avoid frequent polling when the server is already saturated.

Validation Risks

Test vs production differences : hardware, data volume and concurrency differ; always perform a gray‑scale rollout before full production deployment.

Metric lag : buffer‑pool statistics and index statistics need time to warm up; give at least 30 minutes before drawing conclusions.

Summary

MySQL CPU spikes are multi‑layered problems that require a disciplined, hierarchical approach:

Confirm the symptom – use monitoring to gauge severity and scope.

System‑level diagnostics – rule out OS, I/O or memory bottlenecks.

MySQL‑level investigation – locate offending sessions, analyse execution plans, and examine locks, threads, and buffer‑pool health.

Deep analysis – dig into lock waits, connection leaks, and temporary‑table usage.

Fix and verify – apply covering indexes, rewrite SQL, tune configuration, and validate with real‑world metrics and stress tests.

Principles – diagnose before changing, treat slow queries as the primary enemy, prefer indexing over configuration tweaks, and always have a backup and rollback plan.

By regularly reviewing slow‑query logs, monitoring index effectiveness and keeping configuration aligned with workload characteristics, teams can shift from reactive firefighting to proactive performance stewardship, ensuring MySQL remains stable and responsive under load.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

databaseperformance tuningmysqltroubleshootingCPUIndex Optimization
Ops Community
Written by

Ops Community

A leading IT operations community where professionals share and grow together.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.