Databases 13 min read

How to Rescue an Oracle Database from 100% CPU: A Real‑World Troubleshooting Walkthrough

When a critical Oracle 10g database on AIX spikes to 100% CPU, this detailed case study walks through emergency actions, AWR analysis, hidden bottlenecks, and practical tuning steps that reduced CPU usage by half and restored system stability.

dbaplus Community
dbaplus Community
dbaplus Community
How to Rescue an Oracle Database from 100% CPU: A Real‑World Troubleshooting Walkthrough

A client reported that a key Oracle 10g database (version 10.2.0.5) running on an AIX host with 48 virtual CPUs and 96 GB RAM was experiencing sustained 100% CPU usage during a peak business period, causing severe performance degradation and even loss of SSH access.

1. Emergency Handling

Upon arrival the system was still at 100% CPU. The host showed ~92% user time and ~7% system time, with negligible I/O wait and no swapping. The immediate mitigation was to kill idle sessions without active transactions, which brought CPU down to roughly 90% and allowed further analysis.

2. AWR Report Information

An AWR snapshot for the incident period (about 1.5 hours) showed an average active session count (AAS) of 84 and high redo generation, but no severe spikes in hard parses, logins, or sorts. Typical DBA practice would be to query SQL ordered by CPU Time or SQL ordered by Gets to locate top‑SQL statements.

The top‑SQL list did not reveal any single statement dominating CPU; the highest‑consuming SQL accounted for only 1.6% of total CPU time.

3. Was Hard Parsing Masking the Top‑SQL?

Using the force_matching_signature flag can reveal whether missing bind variables hide a high‑impact SQL. After forcing bind variables, the top‑SQL still contributed less than 3% of CPU time, confirming that focusing on a single statement would not solve the problem.

4. Shifting Perspective: Object‑Level Analysis

When SQL‑level diagnostics stall, aggregating at the object level (access‑path analysis) can be fruitful. Querying Segments by Logical Reads highlighted that the top object contributed only 16% of activity, indicating that a few SQL tweaks would not be sufficient.

5. Reality Check

The author notes that the on‑site observations largely mirrored the client DBA’s own findings, and that the real issue required a broader view beyond isolated SQL statements.

6. Investigating Wait Events

The top‑5 wait events were dominated by User I/O and db file scattered read, each with an average latency of ~2 ms—far from being a bottleneck. Traditional I/O‑centric diagnostics (e.g., SQL ordered by Reads) were therefore not the primary path.

7. Additional Pitfalls

Log file synchronization delays were observed: log file sync ~39 ms and log file parallel write ~20 ms. The environment lacked the _use_adaptive_log_file_sync parameter (a 10g feature), prompting a check of storage performance.

Although storage latency can affect log file parallel write, the metric is also influenced by LGWR scheduling under high CPU load, making it an unreliable sole indicator of storage health (see Oracle Support Doc 34583.1).

8. Decision Making

The CTO requested immediate actions. The author proposed a conservative expectation: a 20% CPU reduction by adding four indexes on three heavily accessed tables, without changing any initialization parameters.

Subsequent analysis of the indexed tables showed a SQL with >40 concurrent sessions (ASS ≈ 40) that, after aggregation via force_matching_signature, rose into the top‑20 of AWR samples but still did not appear in the top‑10 SQL list, illustrating the limitation of AWR for such bursty workloads.

Manual scripts aggregating dba_hist_active_sess_history and dba_hist_sqlstat were recommended for deeper insight.

9. Final Tuning Approach

The team focused on high‑IO SQL statements that inflated redo generation, causing LGWR CPU contention and exacerbating DML latency. Instead of modifying the _high_priority_processes parameter (which requires a restart), a temporary renice of the LGWR process was used to alleviate scheduling pressure.

10. Results

The next day CPU usage dropped dramatically, hovering around 60% without any need for further renicing. A new AWR snapshot showed significant improvement in log file sync and log file parallel write timings, and AAS fell to around 30.

11. Takeaways

Regular health checks and SQL reviews can surface many of the latent issues that only become critical during incidents. Investing time in proactive monitoring and knowledge‑base updates yields high returns when emergencies arise.

Key recommendations:

Adjust cpu_count (or cpu_count-2) for single‑instance databases to leave headroom for OS tasks.

In RAC environments, consider instance‑level CPU isolation.

Use Resource Manager to prioritize critical workloads and protect transaction processing from report‑driven batch jobs.

When CPU is >80% or CPU count is high, set _high_priority_processes for LGWR (requires restart) or apply temporary renice to improve scheduling.

These practices helped transform a 100% CPU crisis into a stable, well‑performing system.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

SQLperformance tuningDatabase OptimizationOracleAWRAIXHigh CPU
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.