How We Resolved a Sudden DB Load Spike: Root Cause, Fixes, and SQL Optimization Lessons
This article details a November 2022 database outage caused by a sudden CPU and load surge, explains how the team diagnosed the issue, outlines the emergency steps taken, and shares practical SQL performance optimization recommendations to prevent similar incidents.
Fault Basic Information
Fault occurred on 2022‑11‑09 at 17:50. An alert robot reported a load alarm and the system was almost unavailable. CPU usage spiked to nearly 100% while the Oracle process %CPU was not high, indicating high concurrency rather than a single slow SQL.
Fault Response
1. Fault定位: Logged into the DB server, found the Oracle instance had terminated unexpectedly, and restarted it immediately.
2. Initial analysis suggested that a burst of messages in the corporate WeChat group caused a massive number of concurrent logins, increasing load.
3. After restart, CPU and load remained high, prompting further investigation of running transactions.
Specific steps:
1.1 View currently executing SQL statements:
SELECT A.SQL_ID, A.SQL_TEXT, COUNT(1) FROM V$SESSION S, V$SQLAREA A WHERE S.SQL_ID = A.SQL_ID AND S.SQL_ID IS NOT NULL GROUP BY A.SQL_ID, A.SQL_TEXT ORDER BY 3 DESC1.2 Sample result (data masked):
select column1 FROM demo_table where customer_id = 1000 AND year_month LIKE '%202210%';
select column1 FROM demo_table where customer_id = 1001 AND year_month LIKE '%202210%';
select column1 FROM demo_table where customer_id = 1002 AND year_month LIKE '%202210%';1.3 Calculate each SQL's load percentage using ASH history:
select B.*, C.sql_fulltext sql_text from (select A.sql_id, sum(sample_count) sample_count, count(1) sql_exec_count, trunc(sum(tm_delta_dbtime_ms/1000),2) delta_dbtime_sec, trunc(sum(tm_delta_dbtime_ms)/count(1)/1000,2) sql_ms_per, round(ratio_to_report(sum(tm_delta_dbtime_ms)) over(),4)*100 || '%' db_time_percentage, min(min_sample_time) min_sample_time, max(max_sample_time) max_sample_time from (SELECT sql_id, sql_exec_id, count(1) sample_count, trunc(sum(tm_delta_db_time/1000)) tm_delta_dbtime_ms, to_char(min(sample_time),'yyyymmdd hh24:mi:ss') min_sample_time, to_char(max(sample_time),'yyyymmdd hh24:mi:ss') max_sample_time FROM V$ACTIVE_SESSION_HISTORY ASH WHERE SQL_ID is not null and session_type='FOREGROUND' and ash.sample_time >= to_Date(&begin_Time,'yyyymmdd hh24:mi:ss') and ash.sample_time < to_Date(&end_Time,'yyyymmdd hh24:mi:ss') group by sql_id, sql_exec_id having trunc(sum(tm_delta_db_time/1000))>0) A group by A.sql_id) B, v$sqlarea C where B.sql_id=C.sql_id(+ ) order by B.delta_dbtime_sec desc;The analysis showed many similar SQLs without bind variables, each contributing a small portion of the load; a manual SELECT took about 300 ms, matching the statistics.
Correction Direction
Three urgent fixes were requested from the development team:
Use bind variables.
Pass customer_id as a string instead of a numeric value.
Store year_month as a six‑digit yyyymm value, eliminating the need for LIKE.
The fix was deployed at 18:43, after which the load dropped dramatically, as shown by the following resource usage graphs:
Fault Review
1. Fault handling insights
High DB load is often caused by slow SQL, but a large number of “not‑slow” SQLs under high concurrency can also cripple the system.
If SQL rewriting or index tuning is insufficient, architectural adjustments may be required.
High CPU can be user‑CPU (memory‑intensive) or iowait‑CPU (disk‑intensive).
Frequent SQLs should use bind variables to reduce parsing overhead.
2. SQL execution efficiency considerations
MySQL execution flow:
If the query cache is enabled, it is checked first (generally disabled in production).
If cache miss, the parser performs lexical and syntactic analysis.
The optimizer generates the best execution plan based on cost estimation (I/O + CPU).
The executor interacts with the storage engine to return results.
Common SQL Optimization Recommendations
3.1 Left‑most prefix rule
KEY `idx_shopid_orderno` (`shop_id`,`order_no`) select * from _t where orderno=''Query must include shop_id or reorder the composite index to use order_no.
3.2 Implicit conversion
KEY `idx_mobile` (`mobile`) select * from _user where mobile=12345678901Store mobile as a string and query with quoted value to avoid index loss.
3.3 Pagination
KEY `idx_a_b_c` (`a`,`b`,`c`) select * from _t where a=1 and b=2 order by c desc limit 10000,10;For large offsets, consider using the last retrieved c value (e.g., c < ?) or delayed join with covering indexes.
3.4 IN + ORDER BY
KEY `idx_shopid_status_created` (`shop_id`,`order_status`,`created_at`) select * from _order where shop_id=1 and order_status in (1,2,3) order by created_at desc limit 10;MySQL evaluates IN by scanning each value; large IN lists may cause inaccurate cost estimation. Swapping index column order or using delayed join can help.
3.5 Avoid NOT, <>, OR, LIKE '%...%'
These patterns force full table scans; prefer BETWEEN, LIKE 'prefix%', or rewrite with UNION ALL or EXISTS.
3.6 Use specific column lists instead of SELECT *
Returning unnecessary columns wastes I/O and CPU.
3.7 Limit the number of indexes
Too many indexes degrade INSERT/UPDATE performance; keep indexes to a reasonable number (e.g., ≤6 per table).
Article reference: https://developer.aliyun.com/article/980780
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
