Databases 14 min read

How We Resolved a Sudden DB Load Spike: Root Cause, Fixes, and SQL Optimization Lessons

This article details a November 2022 database outage caused by a sudden CPU and load surge, explains how the team diagnosed the issue, outlines the emergency steps taken, and shares practical SQL performance optimization recommendations to prevent similar incidents.

Ziru Technology
Ziru Technology
Ziru Technology
How We Resolved a Sudden DB Load Spike: Root Cause, Fixes, and SQL Optimization Lessons

Fault Basic Information

Fault occurred on 2022‑11‑09 at 17:50. An alert robot reported a load alarm and the system was almost unavailable. CPU usage spiked to nearly 100% while the Oracle process %CPU was not high, indicating high concurrency rather than a single slow SQL.

Fault Response

1. Fault定位: Logged into the DB server, found the Oracle instance had terminated unexpectedly, and restarted it immediately.

2. Initial analysis suggested that a burst of messages in the corporate WeChat group caused a massive number of concurrent logins, increasing load.

3. After restart, CPU and load remained high, prompting further investigation of running transactions.

Specific steps:

1.1 View currently executing SQL statements:

SELECT A.SQL_ID, A.SQL_TEXT, COUNT(1) FROM V$SESSION S, V$SQLAREA A WHERE S.SQL_ID = A.SQL_ID AND S.SQL_ID IS NOT NULL GROUP BY A.SQL_ID, A.SQL_TEXT ORDER BY 3 DESC

1.2 Sample result (data masked):

select column1 FROM demo_table where customer_id = 1000 AND year_month LIKE '%202210%';
select column1 FROM demo_table where customer_id = 1001 AND year_month LIKE '%202210%';
select column1 FROM demo_table where customer_id = 1002 AND year_month LIKE '%202210%';

1.3 Calculate each SQL's load percentage using ASH history:

select B.*, C.sql_fulltext sql_text from (select A.sql_id, sum(sample_count) sample_count, count(1) sql_exec_count, trunc(sum(tm_delta_dbtime_ms/1000),2) delta_dbtime_sec, trunc(sum(tm_delta_dbtime_ms)/count(1)/1000,2) sql_ms_per, round(ratio_to_report(sum(tm_delta_dbtime_ms)) over(),4)*100 || '%' db_time_percentage, min(min_sample_time) min_sample_time, max(max_sample_time) max_sample_time from (SELECT sql_id, sql_exec_id, count(1) sample_count, trunc(sum(tm_delta_db_time/1000)) tm_delta_dbtime_ms, to_char(min(sample_time),'yyyymmdd hh24:mi:ss') min_sample_time, to_char(max(sample_time),'yyyymmdd hh24:mi:ss') max_sample_time FROM V$ACTIVE_SESSION_HISTORY ASH WHERE SQL_ID is not null and session_type='FOREGROUND' and ash.sample_time >= to_Date(&begin_Time,'yyyymmdd hh24:mi:ss') and ash.sample_time < to_Date(&end_Time,'yyyymmdd hh24:mi:ss') group by sql_id, sql_exec_id having trunc(sum(tm_delta_db_time/1000))>0) A group by A.sql_id) B, v$sqlarea C where B.sql_id=C.sql_id(+ ) order by B.delta_dbtime_sec desc;

The analysis showed many similar SQLs without bind variables, each contributing a small portion of the load; a manual SELECT took about 300 ms, matching the statistics.

Correction Direction

Three urgent fixes were requested from the development team:

Use bind variables.

Pass customer_id as a string instead of a numeric value.

Store year_month as a six‑digit yyyymm value, eliminating the need for LIKE.

The fix was deployed at 18:43, after which the load dropped dramatically, as shown by the following resource usage graphs:

Fault Review

1. Fault handling insights

High DB load is often caused by slow SQL, but a large number of “not‑slow” SQLs under high concurrency can also cripple the system.

If SQL rewriting or index tuning is insufficient, architectural adjustments may be required.

High CPU can be user‑CPU (memory‑intensive) or iowait‑CPU (disk‑intensive).

Frequent SQLs should use bind variables to reduce parsing overhead.

2. SQL execution efficiency considerations

MySQL execution flow:

If the query cache is enabled, it is checked first (generally disabled in production).

If cache miss, the parser performs lexical and syntactic analysis.

The optimizer generates the best execution plan based on cost estimation (I/O + CPU).

The executor interacts with the storage engine to return results.

Common SQL Optimization Recommendations

3.1 Left‑most prefix rule

KEY `idx_shopid_orderno` (`shop_id`,`order_no`)
select * from _t where orderno=''

Query must include shop_id or reorder the composite index to use order_no.

3.2 Implicit conversion

KEY `idx_mobile` (`mobile`)
select * from _user where mobile=12345678901

Store mobile as a string and query with quoted value to avoid index loss.

3.3 Pagination

KEY `idx_a_b_c` (`a`,`b`,`c`)
select * from _t where a=1 and b=2 order by c desc limit 10000,10;

For large offsets, consider using the last retrieved c value (e.g., c < ?) or delayed join with covering indexes.

3.4 IN + ORDER BY

KEY `idx_shopid_status_created` (`shop_id`,`order_status`,`created_at`)
select * from _order where shop_id=1 and order_status in (1,2,3) order by created_at desc limit 10;

MySQL evaluates IN by scanning each value; large IN lists may cause inaccurate cost estimation. Swapping index column order or using delayed join can help.

3.5 Avoid NOT, <>, OR, LIKE '%...%'

These patterns force full table scans; prefer BETWEEN, LIKE 'prefix%', or rewrite with UNION ALL or EXISTS.

3.6 Use specific column lists instead of SELECT *

Returning unnecessary columns wastes I/O and CPU.

3.7 Limit the number of indexes

Too many indexes degrade INSERT/UPDATE performance; keep indexes to a reasonable number (e.g., ≤6 per table).

Article reference: https://developer.aliyun.com/article/980780

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

databasePerformance TuningMySQLincident responseSQL OptimizationOracle
Ziru Technology
Written by

Ziru Technology

Ziru Official Tech Account

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.