How to Quickly Diagnose and Fix a Frozen MySQL in Production: 5 Proven Steps
Facing a MySQL that suddenly becomes unresponsive in production? This article walks through the exact five‑step investigative process—checking process status, examining connections, locating lock waits, analyzing slow queries and system bottlenecks, and applying emergency recovery—illustrated with real‑world examples and command‑line snippets.
MySQL Freeze in Production? These 5 Steps Saved Me
Introduction
At 2 a.m. the phone rang: “MySQL is frozen, all services are down!” A sudden production‑level MySQL stall can cause business interruption, data loss, and customer complaints. After several incidents I distilled a reliable five‑step troubleshooting method to locate and resolve the problem quickly.
Technical Background: What "MySQL Freeze" Means
What is a MySQL freeze?
The term isn’t a formal technical phrase; it usually describes extremely slow or completely unresponsive database behavior. Common causes include:
Lock wait : table, row, or metadata locks blocking many threads
Slow query backlog : complex SQL consuming resources, exhausting the connection pool
Disk I/O bottleneck : full disks, insufficient I/O performance, or storage failures
Memory exhaustion : Buffer pool or temporary tablespace shortage
Connection saturation : max_connections reached, new connections cannot be created
MySQL thread state mechanism
Each client connection maps to a thread that can be in various states, such as Sleep, Query, Locked, Waiting for table metadata lock, Sending data, Sorting result, Creating tmp table, etc. When many threads stay in a non‑normal state for a long time, the server appears “frozen”.
Real‑world incident
During the 2023 Double‑11 promotion, traffic surged and MySQL stalled. Connections jumped from 200 to 2000 in ten seconds, and all API requests timed out. The root cause was an unindexed join on a large table that held a table lock for too long, causing a cascade effect.
Core Content: Detailed 5‑Step Troubleshooting
Step 1 – Quickly verify MySQL process status
Goal: Within 30 seconds confirm the MySQL process is alive and check basic resource usage.
Commands
# Check MySQL process status
ps aux | grep mysqld
# View resource consumption of the MySQL process
top -p $(pidof mysqld)
# Quick system load check
uptime
# Disk space check
df -hKey indicators
Process existence – if missing, MySQL has crashed; check error logs and restart.
CPU usage – near 100% suggests slow queries or full scans; low usage may indicate lock wait or I/O bottleneck.
Memory usage – approaching physical limit or OOM?
Disk space – data or system disk full?
Practical tip
I keep two terminals: one running SHOW PROCESSLIST continuously, the other for analysis and KILL commands, allowing real‑time observation of thread state changes.
# Continuous monitoring
watch -n 2 'mysql -e "SHOW PROCESSLIST\G" | grep -E "(Id|User|State|Time|Info)"'Step 2 – Inspect current connections and thread states
Goal: Identify running SQL and blocking situations.
Core commands
-- View all connections
SHOW FULL PROCESSLIST;
-- Threads running longer than 2 seconds
SELECT * FROM information_schema.PROCESSLIST WHERE COMMAND != 'Sleep' AND TIME > 2 ORDER BY TIME DESC;
-- Count threads per state
SELECT STATE, COUNT(*) AS count FROM information_schema.PROCESSLIST GROUP BY STATE ORDER BY count DESC;Key fields in SHOW PROCESSLIST
Id : thread ID, usable with KILL
User : connection user
Host : client IP
db : current database
Command : type of command (Query/Sleep…)
Time : duration in seconds
State : thread state
Info : executing SQL statement
Danger signals
Many threads in Locked or Waiting for table metadata lock Single query with Time > 60 s still running
Info shows obvious slow query or full table scan
Sleep connections approaching max_connections
Step 3 – Locate lock waits and deadlocks
Goal: Identify lock conflicts and the thread holding the lock.
InnoDB lock queries (MySQL 5.7+)
SELECT r.trx_id AS waiting_trx_id,
r.trx_mysql_thread_id AS waiting_thread,
r.trx_query AS waiting_query,
b.trx_id AS blocking_trx_id,
b.trx_mysql_thread_id AS blocking_thread,
b.trx_query AS blocking_query
FROM information_schema.innodb_lock_waits w
JOIN information_schema.innodb_trx b ON b.trx_id = w.blocking_trx_id
JOIN information_schema.innodb_trx r ON r.trx_id = w.requesting_trx_id;Metadata lock (MDL) inspection
SELECT locked_table_schema, locked_table_name, locked_type,
waiting_processlist_id, waiting_query, blocking_processlist_id
FROM sys.schema_table_lock_waits;Typical scenarios
Scenario 1: Uncommitted transaction holding locks
A developer ran BEGIN; UPDATE users SET status=1; and forgot to commit, leaving the connection in Sleep. Subsequent updates were blocked.
SELECT * FROM information_schema.innodb_trx WHERE trx_state = 'RUNNING';
KILL <thread_id>;Scenario 2: Large‑table DDL causing MDL
Running ALTER TABLE while long‑running queries are active blocks the DDL and all later accesses to that table.
Step 4 – Analyze slow queries and system bottlenecks
Goal: Identify resource‑intensive queries and underlying system constraints.
Slow‑query log analysis
# Use pt‑query‑digest
pt-query-digest /var/log/mysql/slow.log | head -n 100
# Or mysqldumpslow
mysqldumpslow -s t -t 10 /var/log/mysql/slow.logReal‑time performance diagnostics
SHOW GLOBAL STATUS LIKE '%thread%';
SHOW GLOBAL STATUS LIKE '%connection%';
SHOW ENGINE INNODB STATUS\G
# Key metrics to watch
# 1. Threads_connected vs max_connections
# 2. Threads_running (should not exceed 2× CPU cores)
# 3. Innodb_row_lock_waits
# 4. Innodb_buffer_pool_wait_freeSystem‑level checks
# I/O performance
iostat -x 1 10
# Look for %util near 100% and high await
# MySQL I/O
iotop -p $(pidof mysqld)
# Memory
free -h
cat /proc/$(pidof mysqld)/status | grep -E '(VmSize|VmRSS)'Step 5 – Emergency handling and recovery
Goal: Quickly restore service while avoiding secondary damage.
Urgent actions
1. Kill blocking threads
SELECT CONCAT('KILL ', id, ';') FROM information_schema.PROCESSLIST WHERE USER='webapp' AND TIME>60;2. Temporarily adjust parameters
SET GLOBAL max_connections = 2000;
SET GLOBAL innodb_lock_wait_timeout = 10;
SET GLOBAL wait_timeout = 600;3. Restart MySQL (last resort)
# Graceful restart
systemctl restart mysql
# Forceful if needed
killall -9 mysqld
systemctl start mysqlPost‑recovery verification
SHOW GLOBAL STATUS;
SHOW VARIABLES;
SHOW SLAVE STATUS\G;
SELECT COUNT(*) FROM critical_business_table;
tail -f /var/log/mysql/error.logPost‑mortem checklist
Record timeline and screenshots
Save processlist, InnoDB status, etc.
Export relevant slow‑query logs
Analyze root cause
Define preventive measures and optimization plan
Update monitoring and alert rules
Best Practices & Preventive Measures
Configuration recommendations
# my.cnf key settings
[mysqld]
max_connections = 2000
max_connect_errors = 10000
wait_timeout = 600
interactive_timeout = 600
innodb_lock_wait_timeout = 10
lock_wait_timeout = 5
table_open_cache = 4000
slow_query_log = 1
long_query_time = 1
log_queries_not_using_indexes = 1
innodb_buffer_pool_size = 70% of RAM
innodb_log_file_size = 1G
innodb_flush_log_at_trx_commit = 2Monitoring & alerting
Connection count: alert when current > 0.7 × max_connections (check every 30 s)
Long‑running queries: alert when execution > 60 s
Lock wait: alert when wait time > 5 s
Thread pile‑up: alert when Threads_running > 2 × CPU cores
Disk usage: alert when data‑disk > 80 %
Routine maintenance
Daily
Review slow‑query log and optimize top offenders
Check error log for warnings or restarts
Verify backup jobs
Weekly
Analyze table growth trends and plan capacity
Optimize fragmented tables
Audit new SQL for proper indexes
Monthly
Comprehensive performance and stress testing
Update runbooks and emergency procedures
Review all incidents from the past month
Conclusion & Outlook
MySQL freezes are among the most common and urgent incidents for operations engineers. By following the five‑step method—process verification, connection inspection, lock detection, bottleneck analysis, and emergency recovery—most issues can be pinpointed and services restored swiftly. Continuous learning, systematic monitoring, and regular post‑mortems are essential to prevent future freezes.
Ops Community
A leading IT operations community where professionals share and grow together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
