Databases 19 min read

How to Quickly Diagnose and Fix a Frozen MySQL in Production: 5 Proven Steps

Facing a MySQL that suddenly becomes unresponsive in production? This article walks through the exact five‑step investigative process—checking process status, examining connections, locating lock waits, analyzing slow queries and system bottlenecks, and applying emergency recovery—illustrated with real‑world examples and command‑line snippets.

Ops Community
Ops Community
Ops Community
How to Quickly Diagnose and Fix a Frozen MySQL in Production: 5 Proven Steps

MySQL Freeze in Production? These 5 Steps Saved Me

Introduction

At 2 a.m. the phone rang: “MySQL is frozen, all services are down!” A sudden production‑level MySQL stall can cause business interruption, data loss, and customer complaints. After several incidents I distilled a reliable five‑step troubleshooting method to locate and resolve the problem quickly.

Technical Background: What "MySQL Freeze" Means

What is a MySQL freeze?

The term isn’t a formal technical phrase; it usually describes extremely slow or completely unresponsive database behavior. Common causes include:

Lock wait : table, row, or metadata locks blocking many threads

Slow query backlog : complex SQL consuming resources, exhausting the connection pool

Disk I/O bottleneck : full disks, insufficient I/O performance, or storage failures

Memory exhaustion : Buffer pool or temporary tablespace shortage

Connection saturation : max_connections reached, new connections cannot be created

MySQL thread state mechanism

Each client connection maps to a thread that can be in various states, such as Sleep, Query, Locked, Waiting for table metadata lock, Sending data, Sorting result, Creating tmp table, etc. When many threads stay in a non‑normal state for a long time, the server appears “frozen”.

Real‑world incident

During the 2023 Double‑11 promotion, traffic surged and MySQL stalled. Connections jumped from 200 to 2000 in ten seconds, and all API requests timed out. The root cause was an unindexed join on a large table that held a table lock for too long, causing a cascade effect.

Core Content: Detailed 5‑Step Troubleshooting

Step 1 – Quickly verify MySQL process status

Goal: Within 30 seconds confirm the MySQL process is alive and check basic resource usage.

Commands

# Check MySQL process status
ps aux | grep mysqld

# View resource consumption of the MySQL process
top -p $(pidof mysqld)

# Quick system load check
uptime

# Disk space check
df -h

Key indicators

Process existence – if missing, MySQL has crashed; check error logs and restart.

CPU usage – near 100% suggests slow queries or full scans; low usage may indicate lock wait or I/O bottleneck.

Memory usage – approaching physical limit or OOM?

Disk space – data or system disk full?

Practical tip

I keep two terminals: one running SHOW PROCESSLIST continuously, the other for analysis and KILL commands, allowing real‑time observation of thread state changes.

# Continuous monitoring
watch -n 2 'mysql -e "SHOW PROCESSLIST\G" | grep -E "(Id|User|State|Time|Info)"'

Step 2 – Inspect current connections and thread states

Goal: Identify running SQL and blocking situations.

Core commands

-- View all connections
SHOW FULL PROCESSLIST;

-- Threads running longer than 2 seconds
SELECT * FROM information_schema.PROCESSLIST WHERE COMMAND != 'Sleep' AND TIME > 2 ORDER BY TIME DESC;

-- Count threads per state
SELECT STATE, COUNT(*) AS count FROM information_schema.PROCESSLIST GROUP BY STATE ORDER BY count DESC;

Key fields in SHOW PROCESSLIST

Id : thread ID, usable with KILL

User : connection user

Host : client IP

db : current database

Command : type of command (Query/Sleep…)

Time : duration in seconds

State : thread state

Info : executing SQL statement

Danger signals

Many threads in Locked or Waiting for table metadata lock Single query with Time > 60 s still running

Info shows obvious slow query or full table scan

Sleep connections approaching max_connections

Step 3 – Locate lock waits and deadlocks

Goal: Identify lock conflicts and the thread holding the lock.

InnoDB lock queries (MySQL 5.7+)

SELECT r.trx_id AS waiting_trx_id,
       r.trx_mysql_thread_id AS waiting_thread,
       r.trx_query AS waiting_query,
       b.trx_id AS blocking_trx_id,
       b.trx_mysql_thread_id AS blocking_thread,
       b.trx_query AS blocking_query
FROM information_schema.innodb_lock_waits w
JOIN information_schema.innodb_trx b ON b.trx_id = w.blocking_trx_id
JOIN information_schema.innodb_trx r ON r.trx_id = w.requesting_trx_id;

Metadata lock (MDL) inspection

SELECT locked_table_schema, locked_table_name, locked_type,
       waiting_processlist_id, waiting_query, blocking_processlist_id
FROM sys.schema_table_lock_waits;

Typical scenarios

Scenario 1: Uncommitted transaction holding locks

A developer ran BEGIN; UPDATE users SET status=1; and forgot to commit, leaving the connection in Sleep. Subsequent updates were blocked.

SELECT * FROM information_schema.innodb_trx WHERE trx_state = 'RUNNING';
KILL <thread_id>;

Scenario 2: Large‑table DDL causing MDL

Running ALTER TABLE while long‑running queries are active blocks the DDL and all later accesses to that table.

Step 4 – Analyze slow queries and system bottlenecks

Goal: Identify resource‑intensive queries and underlying system constraints.

Slow‑query log analysis

# Use pt‑query‑digest
pt-query-digest /var/log/mysql/slow.log | head -n 100

# Or mysqldumpslow
mysqldumpslow -s t -t 10 /var/log/mysql/slow.log

Real‑time performance diagnostics

SHOW GLOBAL STATUS LIKE '%thread%';
SHOW GLOBAL STATUS LIKE '%connection%';
SHOW ENGINE INNODB STATUS\G

# Key metrics to watch
# 1. Threads_connected vs max_connections
# 2. Threads_running (should not exceed 2× CPU cores)
# 3. Innodb_row_lock_waits
# 4. Innodb_buffer_pool_wait_free

System‑level checks

# I/O performance
iostat -x 1 10
# Look for %util near 100% and high await

# MySQL I/O
iotop -p $(pidof mysqld)

# Memory
free -h
cat /proc/$(pidof mysqld)/status | grep -E '(VmSize|VmRSS)'

Step 5 – Emergency handling and recovery

Goal: Quickly restore service while avoiding secondary damage.

Urgent actions

1. Kill blocking threads

SELECT CONCAT('KILL ', id, ';') FROM information_schema.PROCESSLIST WHERE USER='webapp' AND TIME>60;

2. Temporarily adjust parameters

SET GLOBAL max_connections = 2000;
SET GLOBAL innodb_lock_wait_timeout = 10;
SET GLOBAL wait_timeout = 600;

3. Restart MySQL (last resort)

# Graceful restart
systemctl restart mysql

# Forceful if needed
killall -9 mysqld
systemctl start mysql

Post‑recovery verification

SHOW GLOBAL STATUS;
SHOW VARIABLES;
SHOW SLAVE STATUS\G;
SELECT COUNT(*) FROM critical_business_table;
tail -f /var/log/mysql/error.log

Post‑mortem checklist

Record timeline and screenshots

Save processlist, InnoDB status, etc.

Export relevant slow‑query logs

Analyze root cause

Define preventive measures and optimization plan

Update monitoring and alert rules

Best Practices & Preventive Measures

Configuration recommendations

# my.cnf key settings
[mysqld]
max_connections = 2000
max_connect_errors = 10000
wait_timeout = 600
interactive_timeout = 600
innodb_lock_wait_timeout = 10
lock_wait_timeout = 5
table_open_cache = 4000
slow_query_log = 1
long_query_time = 1
log_queries_not_using_indexes = 1
innodb_buffer_pool_size = 70% of RAM
innodb_log_file_size = 1G
innodb_flush_log_at_trx_commit = 2

Monitoring & alerting

Connection count: alert when current > 0.7 × max_connections (check every 30 s)

Long‑running queries: alert when execution > 60 s

Lock wait: alert when wait time > 5 s

Thread pile‑up: alert when Threads_running > 2 × CPU cores

Disk usage: alert when data‑disk > 80 %

Routine maintenance

Daily

Review slow‑query log and optimize top offenders

Check error log for warnings or restarts

Verify backup jobs

Weekly

Analyze table growth trends and plan capacity

Optimize fragmented tables

Audit new SQL for proper indexes

Monthly

Comprehensive performance and stress testing

Update runbooks and emergency procedures

Review all incidents from the past month

Conclusion & Outlook

MySQL freezes are among the most common and urgent incidents for operations engineers. By following the five‑step method—process verification, connection inspection, lock detection, bottleneck analysis, and emergency recovery—most issues can be pinpointed and services restored swiftly. Continuous learning, systematic monitoring, and regular post‑mortems are essential to prevent future freezes.

operationsMySQLdatabase troubleshootingProduction Incident
Ops Community
Written by

Ops Community

A leading IT operations community where professionals share and grow together.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.