How One SQL Crashed a Core System for 12 Hours – Oracle Performance Troubleshooting
A single runaway SQL caused a 12‑hour outage of a core Oracle system; the article walks through the fault symptoms, AWR and ASH analysis, abnormal SQL identification, execution‑plan changes caused by partition operations, and the step‑by‑step remediation that restored performance.
1. Fault Symptoms
During a weekend, the application began lagging at 4 am and by 2 pm the core system was completely stalled, with the database showing massive wait events.
enq:TX index contention
cursor: pin S wait on X
direct path readMonitoring revealed CPU and I/O usage above 90 %, indicating a severe performance problem.
2. Investigation Process
2.1 AWR Analysis
The first step was to generate an AWR report and correlate it with the ASH report.
Direct Path Read ranked first, consuming 63 % of DB time. This wait occurs when large, rarely‑used tables are scanned and each row is read directly from disk into the PGA, bypassing the shared SGA buffer cache.
2.2 Locating Abnormal SQL
Analysis of the top SQL showed high execution frequency; a plan change was suspected, and the root cause needed to be identified to prevent recurrence.
2.3 Analyzing Execution Plan
Using the SQL ID, the in‑memory execution plan was examined and a clear plan change was observed, leading to full‑table scans and massive I/O.
-- From shared pool
select * from table
(dbms_xplan.display_cursor('&sql_id',null,'typical'));
-- From AWR
select * from table(dbms_xplan.display_awr('&sql_id'));Historical plans showed that since the outage began, the plan had changed and the average CPU time per execution rose to over a hundred seconds.
2.4 Fault Localization
Business confirmation revealed that early morning staff had freed storage by dropping a partition, which altered the table structure. The resulting plan change, combined with missing statistics for the affected partition, caused the performance degradation.
3. Resolution Steps
1. Locate the SQL in memory and flush its plan
select address,hash_value,executions,parse_calls
from v$sqlarea where sql_id='4ca86dg34xg62';
exec sys.dbms_shared_pool.purge('C000000A4C502F40,4103674309','C');
2. Gather partition statistics
BEGIN
DBMS_STATS.GATHER_TABLE_STATS(
ownname => 'YOUR_SCHEMA',
tabname => 'YOUR_PARTITIONED_TABLE',
cascade => TRUE,
estimate_percent => DBMS_STATS.AUTO_SAMPLE_SIZE,
method_opt => 'FOR ALL COLUMNS SIZE AUTO',
degree => DBMS_STATS.DEFAULT_DEGREE);
END;
/Re‑checking the execution plan showed it had returned to the optimal state.
4. Index Maintenance Topology
Partition index loss can trigger plan changes.
1. TRUNCATE or DROP can invalidate global indexes while leaving partition indexes valid.
2. SPLIT on a partition with data makes both global and partition indexes UNUSABLE; SPLIT on an empty partition does not.
3. MOVE on a partition invalidates both global and partition indexes.
4. Besides ADD, operations like TRUNCATE, DROP, EXCHANGE, SPLIT cause global index loss, but UPDATE GLOBAL INDEXES can preserve them.Before Oracle 12c, DDL on partitioned tables (DROP, TRUNCATE, EXCHANGE, SPLIT, MOVE) would invalidate global indexes, requiring costly UPDATE GLOBAL INDEXES or ONLINE clauses, which could introduce long locks.
Oracle 12c introduced asynchronous global index maintenance, allowing fast DDL while a background job later repairs the indexes.
5. Summary
Beware of the hidden Oracle performance killer – Direct Path Read. If the root cause isn’t identified, the same issue is likely to recur.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
