How a Single Slow Query Triggered a Database Avalanche – Full SQL Optimization Walkthrough
A real‑world MySQL incident where a batch UPDATE with an IN‑subquery caused a full‑table scan, connection pool exhaustion, and a system‑wide outage, and the step‑by‑step investigation, emergency mitigation, and comprehensive optimization that reduced query time from 45 seconds to 0.3 seconds.
Incident Overview
At 02:15 a monitoring alarm reported MySQL CPU ≈ 98 %, connections at the max limit (500/500), and P99 response time > 8 s. Users experienced homepage timeouts, unavailable search, order page "Connection timeout" errors, and partial login failures, indicating a full‑scale outage.
Initial Investigation
Database status
mysql -h db-master -u app_user -p
SHOW STATUS LIKE 'Threads_connected'; -- 500/500 (MAX)
SHOW STATUS LIKE 'Max_used_connections'; -- 523
SHOW PROCESSLIST;The processlist contained thousands of connections in Sleep state for 2000‑3000 s, all from the same application server IP.
Load inspection
SHOW FULL PROCESSLIST;Most rows were Sleep, suggesting a connection‑pool leak or long‑running transactions.
InnoDB status
SHOW ENGINE INNODB STATUS\GTransaction history list length was 15890 (normal range is a few hundred), confirming many uncommitted or long‑running transactions.
Slow‑query log
tail -100 /var/log/mysql/slow-query.logTwo queries around 02:10 am ran for 45 s and 38 s, both involving UPDATE and SELECT on the orders table.
Root‑Cause Analysis
Identify longest running queries
SELECT id,user,host,db,command,time,state,LEFT(info,100) AS query
FROM information_schema.processlist
WHERE command!='Daemon'
ORDER BY time DESC
LIMIT 10;The top entry was a 156‑second UPDATE:
UPDATE products p
INNER JOIN order_items oi ON p.id = oi.product_id
SET p.stock = p.stock - oi.quantity,
p.sales_count = p.sales_count + oi.quantity,
p.last_sale_time = NOW()
WHERE oi.order_id IN (
SELECT id FROM orders
WHERE status='pending' AND created_at < DATE_SUB(NOW(),INTERVAL 1 HOUR)
);Query analysis
The subquery scans the orders table (6.5 M rows) because the status index idx_status has low selectivity (≈ 500 k pending rows), causing a full scan.
Execution plan
EXPLAIN UPDATE products p
INNER JOIN order_items oi ON p.id = oi.product_id
SET …
WHERE oi.order_id IN (
SELECT id FROM orders
WHERE status='pending' AND created_at < DATE_SUB(NOW(),INTERVAL 1 HOUR)
);Key findings: products – type=ALL (full scan, ~50 000 rows)
Subquery on orders – type=index (full index scan, ~5 000 rows)
Estimated rows scanned ≈ 50 000
Table statistics
SELECT table_name, table_rows, data_length/1024/1024 AS data_mb,
index_length/1024/1024 AS index_mb
FROM information_schema.tables
WHERE table_schema='app'
ORDER BY data_length DESC;Important tables: orders: 6.5 M rows, 1.2 GB order_items: 15 M rows products: 50 k rows
Root‑cause summary
Batch UPDATE uses IN (SELECT …), forcing a full scan of the large orders table. idx_status has very low selectivity; the optimizer chooses a full scan.
Indirect contributors:
Missing composite index (status, created_at) on orders.
No rate‑limiting on the batch operation.
Slow‑query threshold was 5 s, so the problem was not detected earlier.
Emergency Handling
Terminate runaway queries
SELECT CONCAT('KILL ',id,';') AS kill_cmd
FROM information_schema.processlist
WHERE command='Query' AND time>30;Execute the generated KILL statements (avoid KILL ALL).
Temporary query rewrite
UPDATE products p
INNER JOIN (
SELECT oi.product_id,
SUM(oi.quantity) AS total_qty,
COUNT(DISTINCT oi.order_id) AS order_count
FROM order_items oi
INNER JOIN orders o ON oi.order_id = o.id
WHERE o.status='pending' AND o.created_at < DATE_SUB(NOW(),INTERVAL 1 HOUR)
GROUP BY oi.product_id
) sub ON p.id = sub.product_id
SET p.stock = p.stock - sub.total_qty,
p.sales_count = p.sales_count + sub.order_count,
p.last_sale_time = NOW();Add urgent index
ALTER TABLE orders ADD INDEX idx_status_created (status, created_at);Verify fix
EXPLAIN SELECT id FROM orders WHERE status='pending' AND created_at < DATE_SUB(NOW(),INTERVAL 1 HOUR);Result: type=range, rows≈5 000 (down from ~50 000).
Complete Optimization Plan
Batch‑processing SQL rewrite
-- 1. Get pending order IDs (limit 1000)
SELECT oi.order_id
FROM order_items oi
INNER JOIN orders o ON oi.order_id = o.id
WHERE o.status='pending' AND o.created_at < DATE_SUB(NOW(),INTERVAL 1 HOUR)
LIMIT 1000;
-- 2. Update products in batches
UPDATE products p
INNER JOIN (
SELECT oi.product_id,
SUM(oi.quantity) AS total_qty,
COUNT(DISTINCT oi.order_id) AS order_count
FROM order_items oi
INNER JOIN orders o ON oi.order_id = o.id
WHERE o.status='pending' AND o.created_at < DATE_SUB(NOW(),INTERVAL 1 HOUR)
AND oi.order_id IN (/* 1000 ids */)
GROUP BY oi.product_id
) sub ON p.id = sub.product_id
SET p.stock = p.stock - sub.total_qty,
p.sales_count = p.sales_count + sub.order_count,
p.last_sale_time = NOW();
-- 3. Update order status
UPDATE orders
SET status='processing', updated_at=NOW()
WHERE status='pending' AND created_at < DATE_SUB(NOW(),INTERVAL 1 HOUR)
LIMIT 1000;Index redesign
ALTER TABLE orders ADD INDEX idx_pending_orders (status, created_at, id);
ALTER TABLE order_items ADD INDEX idx_order_product (order_id, product_id);Application‑level batch processing (Python)
def process_pending_orders(batch_size=1000):
while True:
order_ids = get_pending_order_ids(limit=batch_size)
if not order_ids:
break
update_product_sales(order_ids)
update_order_status(order_ids)
commit()
logger.info(f"Processed {len(order_ids)} orders")
time.sleep(0.1)MySQL configuration tweaks
# my.cnf
slow_query_log = 1
slow_query_log_file = /var/log/mysql/slow-query.log
long_query_time = 1
log_queries_not_using_indexes = 1
max_connections = 1000
wait_timeout = 60
innodb_buffer_pool_size = 16G
innodb_log_file_size = 2G
innodb_flush_log_at_trx_commit = 2
innodb_lock_wait_timeout = 30Monitoring script (bash)
#!/bin/bash
MYSQL_HOST="db-master"
MYSQL_USER="monitor"
MYSQL_PASS="monitor_password"
ALERT_EMAIL="[email protected]"
THRESHOLD=5
SLOW_QUERIES=$(mysql -h $MYSQL_HOST -u $MYSQL_USER -p$MYSQL_PASS -e "SELECT COUNT(*) FROM mysql.slow_log WHERE start_time > DATE_SUB(NOW(),INTERVAL 5 MINUTE) AND query_time > $THRESHOLD;" 2>/dev/null | tail -1)
if [ "$SLOW_QUERIES" -gt 10 ]; then
echo "Warning: $SLOW_QUERIES slow queries in last 5 minutes" | mail -s "[WARNING] MySQL slow query alert" $ALERT_EMAIL
fiVerification & Summary
Performance comparison
# Before optimization: 45 s
# After optimization: 0.3 s
EXPLAIN UPDATE products p
INNER JOIN (… ) sub ON p.id = sub.product_id;
-- rows reduced from 50 000 to 5 000, using range scanKey improvements:
Index idx_pending_orders used for range scan.
Index idx_order_product used for ref scan.
Estimated rows dropped from 50 000 to 5 000.
Actual execution time dropped from 45 s to 0.3 s.
Incident timeline
02:00 – Scheduled batch job starts
02:10 – Slow queries accumulate
02:12 – Queries exceed 30 s
02:14 – Connection pool exhausted
02:15 – Monitoring alarm fires
02:17 – Engineer intervenes
02:20 – Kill long‑running queries
02:25 – Add missing index
02:30 – Service restoredLoss assessment
Service outage ≈ 15 minutes
≈ 3 000 orders affected
≈ 2 000 users impacted
Direct financial loss ≈ ¥100 000
SQL Optimization Methodology
Investigation steps
1. Monitoring alarm
2. Check DB health (CPU, connections)
3. Examine PROCESSLIST for long queries
4. Analyse slow‑query log
5. Run EXPLAIN on suspect statements
6. Review table schema and indexes
7. Identify root cause
8. Design optimization plan
9. Implement changes
10. Validate resultsExecution‑plan analysis tips
Common type values and their performance impact:
ALL – full table scan (worst)
index – full index scan (poor)
range – range scan (medium)
ref – index lookup (good)
eq_ref – unique index lookup (excellent)
const – constant lookup (best)
Important Extra hints: Using filesort – requires extra sorting. Using temporary – creates a temporary table. Using index – covering index, optimal. Using where – server‑side filtering needed.
Common optimization techniques
Full table scan → add appropriate index.
Subquery → replace with JOIN or EXISTS.
ORDER BY → ensure index supports ordering or reduce result set.
GROUP BY → use covering index or limit rows.
DISTINCT → leverage index coverage.
LIKE with leading wildcard → use full‑text index or redesign query.
OR conditions → split into UNION or add composite index.
NULL checks → prefer IS NULL / IS NOT NULL.
Preventive measures
SQL review with EXPLAIN before release.
Set realistic slow‑query thresholds.
Regular index audit (remove unused, add missing).
Connection‑pool timeout management.
Rate‑limit large batch operations.
Periodic performance inspections.
Handy diagnostic commands
SHOW FULL PROCESSLIST;
SELECT * FROM mysql.slow_log ORDER BY start_time DESC LIMIT 10;
SELECT * FROM performance_schema.table_io_waits_summary_by_index_usage WHERE object_schema='app';
SELECT * FROM information_schema.innodb_lock_waits;
SELECT * FROM information_schema.innodb_trx WHERE trx_state='LOCK WAIT';Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Ops Community
A leading IT operations community where professionals share and grow together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
