How to Rescue an Oracle Database from 100% CPU: A Real‑World Troubleshooting Walkthrough
When a critical Oracle 10g database on AIX spikes to 100% CPU, this detailed case study walks through emergency actions, AWR analysis, hidden bottlenecks, and practical tuning steps that reduced CPU usage by half and restored system stability.
A client reported that a key Oracle 10g database (version 10.2.0.5) running on an AIX host with 48 virtual CPUs and 96 GB RAM was experiencing sustained 100% CPU usage during a peak business period, causing severe performance degradation and even loss of SSH access.
1. Emergency Handling
Upon arrival the system was still at 100% CPU. The host showed ~92% user time and ~7% system time, with negligible I/O wait and no swapping. The immediate mitigation was to kill idle sessions without active transactions, which brought CPU down to roughly 90% and allowed further analysis.
2. AWR Report Information
An AWR snapshot for the incident period (about 1.5 hours) showed an average active session count (AAS) of 84 and high redo generation, but no severe spikes in hard parses, logins, or sorts. Typical DBA practice would be to query SQL ordered by CPU Time or SQL ordered by Gets to locate top‑SQL statements.
The top‑SQL list did not reveal any single statement dominating CPU; the highest‑consuming SQL accounted for only 1.6% of total CPU time.
3. Was Hard Parsing Masking the Top‑SQL?
Using the force_matching_signature flag can reveal whether missing bind variables hide a high‑impact SQL. After forcing bind variables, the top‑SQL still contributed less than 3% of CPU time, confirming that focusing on a single statement would not solve the problem.
4. Shifting Perspective: Object‑Level Analysis
When SQL‑level diagnostics stall, aggregating at the object level (access‑path analysis) can be fruitful. Querying Segments by Logical Reads highlighted that the top object contributed only 16% of activity, indicating that a few SQL tweaks would not be sufficient.
5. Reality Check
The author notes that the on‑site observations largely mirrored the client DBA’s own findings, and that the real issue required a broader view beyond isolated SQL statements.
6. Investigating Wait Events
The top‑5 wait events were dominated by User I/O and db file scattered read, each with an average latency of ~2 ms—far from being a bottleneck. Traditional I/O‑centric diagnostics (e.g., SQL ordered by Reads) were therefore not the primary path.
7. Additional Pitfalls
Log file synchronization delays were observed: log file sync ~39 ms and log file parallel write ~20 ms. The environment lacked the _use_adaptive_log_file_sync parameter (a 10g feature), prompting a check of storage performance.
Although storage latency can affect log file parallel write, the metric is also influenced by LGWR scheduling under high CPU load, making it an unreliable sole indicator of storage health (see Oracle Support Doc 34583.1).
8. Decision Making
The CTO requested immediate actions. The author proposed a conservative expectation: a 20% CPU reduction by adding four indexes on three heavily accessed tables, without changing any initialization parameters.
Subsequent analysis of the indexed tables showed a SQL with >40 concurrent sessions (ASS ≈ 40) that, after aggregation via force_matching_signature, rose into the top‑20 of AWR samples but still did not appear in the top‑10 SQL list, illustrating the limitation of AWR for such bursty workloads.
Manual scripts aggregating dba_hist_active_sess_history and dba_hist_sqlstat were recommended for deeper insight.
9. Final Tuning Approach
The team focused on high‑IO SQL statements that inflated redo generation, causing LGWR CPU contention and exacerbating DML latency. Instead of modifying the _high_priority_processes parameter (which requires a restart), a temporary renice of the LGWR process was used to alleviate scheduling pressure.
10. Results
The next day CPU usage dropped dramatically, hovering around 60% without any need for further renicing. A new AWR snapshot showed significant improvement in log file sync and log file parallel write timings, and AAS fell to around 30.
11. Takeaways
Regular health checks and SQL reviews can surface many of the latent issues that only become critical during incidents. Investing time in proactive monitoring and knowledge‑base updates yields high returns when emergencies arise.
Key recommendations:
Adjust cpu_count (or cpu_count-2) for single‑instance databases to leave headroom for OS tasks.
In RAC environments, consider instance‑level CPU isolation.
Use Resource Manager to prioritize critical workloads and protect transaction processing from report‑driven batch jobs.
When CPU is >80% or CPU count is high, set _high_priority_processes for LGWR (requires restart) or apply temporary renice to improve scheduling.
These practices helped transform a 100% CPU crisis into a stable, well‑performing system.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
