How to Diagnose and Fix MySQL CPU Spikes to 100% in Production
This guide walks you through a complete, step‑by‑step process for identifying why MySQL CPU usage jumps to 100%, from initial symptom verification and data‑flow analysis to locating slow queries, killing them, optimizing SQL, adding indexes, and setting up monitoring and alerts to prevent recurrence.
Scenario
In a production environment MySQL CPU usage suddenly spikes to 100%, causing slow or timed‑out responses. The guide assumes MySQL 5.7+ / MariaDB 10.3+ on Linux with Performance Schema enabled and root/DBA privileges.
Diagnostic Workflow
The workflow consists of six steps: confirm the symptom, locate the problematic layer, apply rapid mitigation, perform root‑cause analysis, implement permanent fixes, and set up monitoring to prevent recurrence.
Step 1 – Immediate Confirmation
Verify that MySQL is the source of the CPU spike.
Overall CPU: top -bn1 | head -20 Per‑core usage: mpstat -P ALL 1 5 MySQL process CPU: pidstat -p $(pgrep mysqld) 1 5 Disk I/O: iostat -xm 2 5 Memory: free -h Network connections: ss -s Key metric meanings: %us – user‑mode CPU (query execution) %sy – kernel‑mode CPU (system calls) %wa – I/O wait %idle – idle CPU
Capture a snapshot before making changes:
which top mpstat iostat pidstat
pgrep -a mysqldStep 2 – Locate Slow Queries (MySQL Layer)
Show all running queries: mysql -uroot -p -e "SHOW FULL PROCESSLIST\G" Count connections per state:
SELECT state, COUNT(*) FROM information_schema.PROCESSLIST WHERE command!='Sleep' GROUP BY state;Top 10 longest‑running queries:
SELECT id, user, host, db, time, LEFT(info,100) FROM information_schema.PROCESSLIST WHERE command!='Sleep' ORDER BY time DESC LIMIT 10;Historical CPU‑time aggregation via Performance Schema:
SELECT digest_text, count_star, ROUND(sum_timer_wait/1e12,2) AS total_sec FROM performance_schema.events_statements_summary_by_digest ORDER BY sum_timer_wait DESC LIMIT 10;Real‑time slow statements (execution >1 s):
SELECT pps.thread_id, pps.event_name, pps.sql_text, ROUND(pps.timer_wait/1e12,2) AS exec_sec FROM performance_schema.events_statements_current pps JOIN information_schema.PROCESSLIST p ON pps.thread_id=p.id WHERE pps.timer_wait>1e12 ORDER BY pps.timer_wait DESC LIMIT 10;Important columns: Time, State, rows_examined, rows_sent, digest_text.
Step 3 – Emergency Mitigation (Kill or Rate‑Limit)
Kill a specific query : KILL 987654; Batch kill queries running >30 s :
SELECT CONCAT('KILL ',id,';') FROM information_schema.PROCESSLIST WHERE time>30 AND command='Query';Temporarily limit user connections (prevent connection storms):
ALTER USER 'app_user'@'%' WITH MAX_USER_CONNECTIONS 20;Application‑level rate limiting (e.g., ProxySQL query rules):
INSERT INTO mysql_query_rules (rule_id, active, match_pattern, max_qps, apply) VALUES (1,1,'^SELECT.*FROM orders.*',100,1);Verify mitigation:
# CPU drop
watch -n 2 'mysql -uroot -p -e "SHOW PROCESSLIST" | grep "Sending data" | wc -l'
# Active threads
mysql -uroot -p -e "SHOW STATUS LIKE 'Threads_running';"
# Business response time
time curl -s "http://api.example.com/orders?user_id=123" > /dev/nullStep 4 – Root‑Cause Analysis (EXPLAIN + Slow‑Query Log)
Assume the problematic SQL is:
SLOW_SQL="SELECT * FROM orders o LEFT JOIN order_items oi ON o.id=oi.order_id WHERE o.create_time>'2024-01-01' ORDER BY o.create_time DESC"Run EXPLAIN:
mysql -uroot -p ecommerce -e "EXPLAIN $SLOW_SQL\G"Typical problematic output: type=ALL – full table scan key=NULL – no index used
Rows scanned in the millions Extra: Using temporary; Using filesort For MySQL 8.0.18+ use EXPLAIN ANALYZE to see actual timing.
Inspect table definition and existing indexes:
mysql -uroot -p ecommerce -e "SHOW CREATE TABLE orders\G"
mysql -uroot -p ecommerce -e "SHOW INDEX FROM orders\G"Analyze the slow‑query log with Percona Toolkit:
pt-query-digest --since 1h /var/lib/mysql/slow.logStep 5 – SQL Optimization & Index Creation
Typical findings for the example query:
Missing index on orders.create_time (used in WHERE and ORDER BY)
Missing index on order_items.order_id (join predicate)
SELECT * returns unnecessary columns
Fixes:
# Create index on create_time
mysql -uroot -p ecommerce -e "CREATE INDEX idx_create_time ON orders(create_time);"
# Create index on order_id for the join
mysql -uroot -p ecommerce -e "CREATE INDEX idx_order_id ON order_items(order_id);"
# Verify indexes
mysql -uroot -p ecommerce -e "SHOW INDEX FROM orders WHERE Key_name='idx_create_time'\G"
# Re‑run EXPLAIN to confirm index usage
mysql -uroot -p ecommerce -e "EXPLAIN $SLOW_SQL\G"
# Rewrite SQL to select only needed columns and add LIMIT
OPTIMIZED_SQL="SELECT o.id, o.user_id, o.create_time, o.status, oi.product_id, oi.quantity FROM orders o LEFT JOIN order_items oi ON o.id=oi.order_id WHERE o.create_time>'2024-01-01' ORDER BY o.create_time DESC LIMIT 1000;"
# Benchmark before/after
time mysql -uroot -p ecommerce -e "$SLOW_SQL" > /dev/null
time mysql -uroot -p ecommerce -e "$OPTIMIZED_SQL" > /dev/nullTypical result: execution time drops from >200 s to <1 s.
Common pitfalls:
Online DDL may still lock the table for a long time – use ALGORITHM=INPLACE, LOCK=NONE on MySQL 5.6+
Insufficient disk space for the new index
Out‑of‑date statistics – run ANALYZE TABLE after index creation
Implicit type conversion (e.g., comparing BIGINT column to a quoted string) prevents index use
Step 6 – Monitoring, Validation & Prevention
Continuously verify that CPU usage stays low and that slow‑query count does not grow.
CPU usage (every 5 s):
watch -n 5 'top -bn1 | grep mysqld | awk "{print $9}"'Slow‑query count (every 10 s):
watch -n 10 'mysql -uroot -p -e "SHOW GLOBAL STATUS LIKE \"Slow_queries\";"'Active threads (every 5 s):
watch -n 5 'mysql -uroot -p -e "SHOW STATUS LIKE \"Threads_running\";"'Enable long‑running slow‑query logging permanently:
SET GLOBAL slow_query_log=1;
SET GLOBAL long_query_time=2;
SET GLOBAL log_queries_not_using_indexes=1;Automate daily analysis with a cron job:
# /usr/local/bin/analyze_slow_log.sh
#!/bin/bash
REPORT=/var/log/mysql_slow_report_$(date +%Y%m%d).txt
pt-query-digest --limit 20 /var/lib/mysql/slow.log > $REPORT
mail -s "MySQL Slow Query Daily Report" [email protected] < $REPORT 0 8 * * * /usr/local/bin/analyze_slow_log.shCommon Faults & Troubleshooting
#1 – CPU 100 % but no slow queries
Cause: connection‑pool size too large or leak.
Quick fix: restart application, set wait_timeout=60.
Permanent fix: tune pool size, enable automatic connection reclamation.
#2 – Single core maxed out
Cause: one long‑running query or lock wait.
Quick fix: KILL the offending query.
Permanent fix: rewrite SQL, split transaction, adjust isolation level.
#3 – High CPU & high I/O
Cause: disk I/O bottleneck.
Quick fix: increase innodb_io_capacity or move data to SSD.
Permanent fix: hardware upgrade, query optimization.
#4 – Optimized query still slow
Cause: low‑selectivity index or stale statistics.
Quick fix: use FORCE INDEX.
Permanent fix: create a better composite index, refresh stats.
#5 – Scheduled task spikes CPU
Cause: nightly statistics or batch job.
Quick fix: pause or reschedule the task.
Permanent fix: rewrite job to run incrementally or use materialized view.
#6 – Index exists but not used
Cause: implicit type conversion, functions on indexed columns, OR conditions.
Quick fix: rewrite SQL to avoid conversions, match data types.
Permanent fix: enforce coding standards, enable SQL audit.
Gray‑Release & Rollback Strategy
Apply changes safely using a staged approach.
# Stage 1 – Verify on test environment
# Stage 2 – Create index on replica first
mysql -h192.168.1.11 -uroot -p -e "CREATE INDEX idx_create_time ON orders(create_time);"
# Stage 3 – Apply on primary during low‑traffic window (online DDL)
mysql -h192.168.1.10 -uroot -p -e "CREATE INDEX idx_create_time ON orders(create_time) ALGORITHM=INPLACE, LOCK=NONE;"
# Stage 4 – Deploy application changes gradually (10% → 100%)Rollback conditions:
Index creation locks table >1 min
Disk free space drops below 10 % after index
QPS degrades >20 % post‑optimization
Rollback steps:
# Drop the new index
mysql -uroot -p -e "DROP INDEX idx_create_time ON orders;"
# Revert application code/configuration if needed
# Verify removal
mysql -uroot -p -e "SHOW INDEX FROM orders;"Backup Script (Run Before Optimization)
#!/bin/bash
BACKUP_DIR=/data/backup/mysql_$(date +%Y%m%d_%H%M%S)
mkdir -p $BACKUP_DIR
# Table schema
mysqldump -uroot -p --no-data ecommerce orders order_items > $BACKUP_DIR/schema.sql
# Index definitions
mysql -uroot -p -e "SHOW INDEX FROM orders" > $BACKUP_DIR/indexes_orders.txt
mysql -uroot -p -e "SHOW INDEX FROM order_items" > $BACKUP_DIR/indexes_order_items.txt
# Table statistics
mysql -uroot -p -e "SHOW TABLE STATUS FROM ecommerce WHERE Name IN ('orders','order_items')" > $BACKUP_DIR/table_stats.txt
# Slow‑query log snapshot (last hour)
cp /var/lib/mysql/slow.log $BACKUP_DIR/slow_log_before.log
echo "Backup completed: $BACKUP_DIR"Best Practices
Enable slow‑query log with long_query_time=2 and log_queries_not_using_indexes=1.
Design indexes following the left‑most prefix rule; keep per‑table index count < 5.
Avoid SELECT *; select only required columns.
Refresh statistics regularly: ANALYZE TABLE orders, order_items; Configure connection pools sensibly (e.g., Hikari max‑pool‑size 20).
Use query cache only on MySQL 5.7‑ and lower, with caution.
Profile individual queries with SET profiling=1; and SHOW PROFILES;.
FAQ
Q1: Is high CPU always caused by slow queries? Not necessarily – it can also be caused by connection storms, lock contention, replication lag, or OS‑level processes.
Q2: Why does EXPLAIN sometimes look fine but the query is still slow? EXPLAIN shows the planned execution path, not actual runtime statistics. Stale statistics, data skew, or I/O bottlenecks can make the plan inaccurate. Use EXPLAIN ANALYZE (MySQL 8.0.18+) for real execution metrics.
Q3: When should I add an index? Add an index when:
Query execution time > 2 s.
EXPLAIN shows type=ALL (full table scan).
Rows examined > 10 k with low filtered percentage. Extra contains Using filesort or Using temporary.
Q4: How to design a composite index? Follow these principles:
Equality columns first, then range columns.
High‑selectivity columns should lead the index.
Include covering columns when possible to avoid back‑table lookups.
Q5: How to avoid deep‑pagination performance issues? Use delayed join or bookmark pagination, or fetch primary keys first and then join to the full rows.
Q6: How to handle sudden traffic spikes? Apply rate‑limiting, degrade non‑critical features, scale out read replicas, or trigger circuit‑breaker logic.
Q7: When should I NOT add an index? Avoid indexes on:
Very small tables (< 1 k rows).
Low‑cardinality columns (e.g., gender).
Columns that are frequently updated.
BLOB/TEXT columns without a prefix.
Prometheus Monitoring Example
# /etc/prometheus/prometheus.yml
scrape_configs:
- job_name: 'mysqld_exporter'
static_configs:
- targets: ['192.168.1.10:9104']
labels:
instance: 'mysql-master'
cluster: 'production' # /etc/prometheus/rules/mysql_cpu_alerts.yml
groups:
- name: mysql_performance
interval: 10s
rules:
- alert: MySQLCPUHigh
expr: rate(process_cpu_seconds_total{job="mysqld_exporter"}[1m])*100 > 80
for: 5m
labels:
severity: critical
annotations:
summary: "MySQL CPU usage too high"
description: "Instance {{ $labels.instance }} has CPU >80% for 5 minutes"
action: "Check slow queries, kill or add indexes"Key References
MySQL Performance Optimization – https://dev.mysql.com/doc/refman/8.0/en/optimization.html
MySQL EXPLAIN Documentation – https://dev.mysql.com/doc/refman/8.0/en/explain-output.html
Percona Toolkit – https://www.percona.com/software/database-tools/percona-toolkit
mysqld_exporter – https://github.com/prometheus/mysqld_exporter
PMM – https://www.percona.com/software/database-tools/percona-monitoring-and-management
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
