Databases 20 min read

How a Single Slow Query Triggered a Database Avalanche – Full SQL Optimization Walkthrough

A real‑world MySQL incident where a batch UPDATE with an IN‑subquery caused a full‑table scan, connection pool exhaustion, and a system‑wide outage, and the step‑by‑step investigation, emergency mitigation, and comprehensive optimization that reduced query time from 45 seconds to 0.3 seconds.

Ops Community
Ops Community
Ops Community
How a Single Slow Query Triggered a Database Avalanche – Full SQL Optimization Walkthrough

Incident Overview

At 02:15 a monitoring alarm reported MySQL CPU ≈ 98 %, connections at the max limit (500/500), and P99 response time > 8 s. Users experienced homepage timeouts, unavailable search, order page "Connection timeout" errors, and partial login failures, indicating a full‑scale outage.

Initial Investigation

Database status

mysql -h db-master -u app_user -p
SHOW STATUS LIKE 'Threads_connected';   -- 500/500 (MAX)
SHOW STATUS LIKE 'Max_used_connections'; -- 523
SHOW PROCESSLIST;

The processlist contained thousands of connections in Sleep state for 2000‑3000 s, all from the same application server IP.

Load inspection

SHOW FULL PROCESSLIST;

Most rows were Sleep, suggesting a connection‑pool leak or long‑running transactions.

InnoDB status

SHOW ENGINE INNODB STATUS\G

Transaction history list length was 15890 (normal range is a few hundred), confirming many uncommitted or long‑running transactions.

Slow‑query log

tail -100 /var/log/mysql/slow-query.log

Two queries around 02:10 am ran for 45 s and 38 s, both involving UPDATE and SELECT on the orders table.

Root‑Cause Analysis

Identify longest running queries

SELECT id,user,host,db,command,time,state,LEFT(info,100) AS query
FROM information_schema.processlist
WHERE command!='Daemon'
ORDER BY time DESC
LIMIT 10;

The top entry was a 156‑second UPDATE:

UPDATE products p
INNER JOIN order_items oi ON p.id = oi.product_id
SET p.stock = p.stock - oi.quantity,
    p.sales_count = p.sales_count + oi.quantity,
    p.last_sale_time = NOW()
WHERE oi.order_id IN (
  SELECT id FROM orders
  WHERE status='pending' AND created_at < DATE_SUB(NOW(),INTERVAL 1 HOUR)
);

Query analysis

The subquery scans the orders table (6.5 M rows) because the status index idx_status has low selectivity (≈ 500 k pending rows), causing a full scan.

Execution plan

EXPLAIN UPDATE products p
INNER JOIN order_items oi ON p.id = oi.product_id
SET …
WHERE oi.order_id IN (
  SELECT id FROM orders
  WHERE status='pending' AND created_at < DATE_SUB(NOW(),INTERVAL 1 HOUR)
);

Key findings: products – type=ALL (full scan, ~50 000 rows)

Subquery on orders – type=index (full index scan, ~5 000 rows)

Estimated rows scanned ≈ 50 000

Table statistics

SELECT table_name, table_rows, data_length/1024/1024 AS data_mb,
       index_length/1024/1024 AS index_mb
FROM information_schema.tables
WHERE table_schema='app'
ORDER BY data_length DESC;

Important tables: orders: 6.5 M rows, 1.2 GB order_items: 15 M rows products: 50 k rows

Root‑cause summary

Batch UPDATE uses IN (SELECT …), forcing a full scan of the large orders table. idx_status has very low selectivity; the optimizer chooses a full scan.

Indirect contributors:

Missing composite index (status, created_at) on orders.

No rate‑limiting on the batch operation.

Slow‑query threshold was 5 s, so the problem was not detected earlier.

Emergency Handling

Terminate runaway queries

SELECT CONCAT('KILL ',id,';') AS kill_cmd
FROM information_schema.processlist
WHERE command='Query' AND time>30;

Execute the generated KILL statements (avoid KILL ALL).

Temporary query rewrite

UPDATE products p
INNER JOIN (
  SELECT oi.product_id,
         SUM(oi.quantity) AS total_qty,
         COUNT(DISTINCT oi.order_id) AS order_count
  FROM order_items oi
  INNER JOIN orders o ON oi.order_id = o.id
  WHERE o.status='pending' AND o.created_at < DATE_SUB(NOW(),INTERVAL 1 HOUR)
  GROUP BY oi.product_id
) sub ON p.id = sub.product_id
SET p.stock = p.stock - sub.total_qty,
    p.sales_count = p.sales_count + sub.order_count,
    p.last_sale_time = NOW();

Add urgent index

ALTER TABLE orders ADD INDEX idx_status_created (status, created_at);

Verify fix

EXPLAIN SELECT id FROM orders WHERE status='pending' AND created_at < DATE_SUB(NOW(),INTERVAL 1 HOUR);

Result: type=range, rows≈5 000 (down from ~50 000).

Complete Optimization Plan

Batch‑processing SQL rewrite

-- 1. Get pending order IDs (limit 1000)
SELECT oi.order_id
FROM order_items oi
INNER JOIN orders o ON oi.order_id = o.id
WHERE o.status='pending' AND o.created_at < DATE_SUB(NOW(),INTERVAL 1 HOUR)
LIMIT 1000;

-- 2. Update products in batches
UPDATE products p
INNER JOIN (
  SELECT oi.product_id,
         SUM(oi.quantity) AS total_qty,
         COUNT(DISTINCT oi.order_id) AS order_count
  FROM order_items oi
  INNER JOIN orders o ON oi.order_id = o.id
  WHERE o.status='pending' AND o.created_at < DATE_SUB(NOW(),INTERVAL 1 HOUR)
    AND oi.order_id IN (/* 1000 ids */)
  GROUP BY oi.product_id
) sub ON p.id = sub.product_id
SET p.stock = p.stock - sub.total_qty,
    p.sales_count = p.sales_count + sub.order_count,
    p.last_sale_time = NOW();

-- 3. Update order status
UPDATE orders
SET status='processing', updated_at=NOW()
WHERE status='pending' AND created_at < DATE_SUB(NOW(),INTERVAL 1 HOUR)
LIMIT 1000;

Index redesign

ALTER TABLE orders ADD INDEX idx_pending_orders (status, created_at, id);
ALTER TABLE order_items ADD INDEX idx_order_product (order_id, product_id);

Application‑level batch processing (Python)

def process_pending_orders(batch_size=1000):
    while True:
        order_ids = get_pending_order_ids(limit=batch_size)
        if not order_ids:
            break
        update_product_sales(order_ids)
        update_order_status(order_ids)
        commit()
        logger.info(f"Processed {len(order_ids)} orders")
        time.sleep(0.1)

MySQL configuration tweaks

# my.cnf
slow_query_log = 1
slow_query_log_file = /var/log/mysql/slow-query.log
long_query_time = 1
log_queries_not_using_indexes = 1
max_connections = 1000
wait_timeout = 60
innodb_buffer_pool_size = 16G
innodb_log_file_size = 2G
innodb_flush_log_at_trx_commit = 2
innodb_lock_wait_timeout = 30

Monitoring script (bash)

#!/bin/bash
MYSQL_HOST="db-master"
MYSQL_USER="monitor"
MYSQL_PASS="monitor_password"
ALERT_EMAIL="[email protected]"
THRESHOLD=5
SLOW_QUERIES=$(mysql -h $MYSQL_HOST -u $MYSQL_USER -p$MYSQL_PASS -e "SELECT COUNT(*) FROM mysql.slow_log WHERE start_time > DATE_SUB(NOW(),INTERVAL 5 MINUTE) AND query_time > $THRESHOLD;" 2>/dev/null | tail -1)
if [ "$SLOW_QUERIES" -gt 10 ]; then
  echo "Warning: $SLOW_QUERIES slow queries in last 5 minutes" | mail -s "[WARNING] MySQL slow query alert" $ALERT_EMAIL
fi

Verification & Summary

Performance comparison

# Before optimization: 45 s
# After optimization: 0.3 s
EXPLAIN UPDATE products p
INNER JOIN (… ) sub ON p.id = sub.product_id;
-- rows reduced from 50 000 to 5 000, using range scan

Key improvements:

Index idx_pending_orders used for range scan.

Index idx_order_product used for ref scan.

Estimated rows dropped from 50 000 to 5 000.

Actual execution time dropped from 45 s to 0.3 s.

Incident timeline

02:00 – Scheduled batch job starts
02:10 – Slow queries accumulate
02:12 – Queries exceed 30 s
02:14 – Connection pool exhausted
02:15 – Monitoring alarm fires
02:17 – Engineer intervenes
02:20 – Kill long‑running queries
02:25 – Add missing index
02:30 – Service restored

Loss assessment

Service outage ≈ 15 minutes

≈ 3 000 orders affected

≈ 2 000 users impacted

Direct financial loss ≈ ¥100 000

SQL Optimization Methodology

Investigation steps

1. Monitoring alarm
2. Check DB health (CPU, connections)
3. Examine PROCESSLIST for long queries
4. Analyse slow‑query log
5. Run EXPLAIN on suspect statements
6. Review table schema and indexes
7. Identify root cause
8. Design optimization plan
9. Implement changes
10. Validate results

Execution‑plan analysis tips

Common type values and their performance impact:

ALL – full table scan (worst)

index – full index scan (poor)

range – range scan (medium)

ref – index lookup (good)

eq_ref – unique index lookup (excellent)

const – constant lookup (best)

Important Extra hints: Using filesort – requires extra sorting. Using temporary – creates a temporary table. Using index – covering index, optimal. Using where – server‑side filtering needed.

Common optimization techniques

Full table scan → add appropriate index.

Subquery → replace with JOIN or EXISTS.

ORDER BY → ensure index supports ordering or reduce result set.

GROUP BY → use covering index or limit rows.

DISTINCT → leverage index coverage.

LIKE with leading wildcard → use full‑text index or redesign query.

OR conditions → split into UNION or add composite index.

NULL checks → prefer IS NULL / IS NOT NULL.

Preventive measures

SQL review with EXPLAIN before release.

Set realistic slow‑query thresholds.

Regular index audit (remove unused, add missing).

Connection‑pool timeout management.

Rate‑limit large batch operations.

Periodic performance inspections.

Handy diagnostic commands

SHOW FULL PROCESSLIST;
SELECT * FROM mysql.slow_log ORDER BY start_time DESC LIMIT 10;
SELECT * FROM performance_schema.table_io_waits_summary_by_index_usage WHERE object_schema='app';
SELECT * FROM information_schema.innodb_lock_waits;
SELECT * FROM information_schema.innodb_trx WHERE trx_state='LOCK WAIT';
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

monitoringindexingperformance tuningMySQLSQL optimizationslow querydatabase avalanche
Ops Community
Written by

Ops Community

A leading IT operations community where professionals share and grow together.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.