How a Hidden Oracle RAC Bug Caused a Database Hang and the Steps We Took to Resolve It
A production Oracle RAC database on IBM Power8 experienced a severe hang due to incomplete checkpoints caused by a bug in index handling, and the detailed analysis, root‑cause identification, and three‑step remediation—including index rebuild, application isolation, and a hidden parameter tweak—ultimately restored full service.
1. Fault Phenomenon
During production hours a client’s core system showed session backlog and numerous abnormal wait events; all online redo log groups remained ACTIVE, preventing checkpoint completion and causing the database instance to hang.
A similar issue recurred the following week, but a pre‑planned response limited impact and allowed more data collection for a final resolution.
2. Environment
IBM Power8 E880, AIX 7.1, Oracle 4‑node RAC 11.2.0.3.15.
3. Fault Pre‑Plan
Deploy SMS alerts to monitor the number of ACTIVE log groups and trigger warnings when exceeding four.
When the issue appears, add usable log groups to delay the hang and give diagnostic time.
Emergency stop the hung node; RAC’s high‑availability architecture prevents single‑point failure.
4. Fault Handling Process
To enable post‑mortem analysis, a hanganalyze was run and a system dump was collected. Because the system was very busy, the dump took 60–90 minutes; after about ten minutes the dump was aborted and node 2 was stopped, allowing the database to recover by 11:48 AM.
5. Resource Usage Analysis
Comparing Friday’s load with normal days showed overall CPU and I/O usage remained within normal ranges across all four nodes.
5. Fault Cause Analysis
5.1 Checkpoint Incompletion
Initial suspicion fell on the archive log directory, but space and read/write checks were fine. The dump was then examined for deeper clues.
5.2 System Dump Findings
Four blocking sources were identified:
Two sequence objects (SEQ_LNCH_XXXX_TRIGID and SEQ_LNCH_XXXX_TRIGLOGID) waiting on row cache locks, indicating data‑dictionary contention.
An INSERT statement blocker holding a session on node 2 (LGWR process) waiting for log file switch (checkpoint incomplete).
UPDATE statements blocked while waiting for log switch completion.
The log‑switch checkpoint, initiated by CKPT, requires LGWR to read the control file to obtain the checkpoint queue.
Analysis showed DBWR processes reported ‘rdbms ipc message’ (idle), suggesting the system believed no dirty buffers existed, yet the dump revealed a dirty block queue containing an index object (IDX_HQMHLCD_RZ_1) linked to table UDP_PROD31.QFCHL_XXXXACTION_LOG.
Further log inspection showed the index’s related SQL statements suffered GC current request and GC CR multi‑block request waits, confirming heavy contention.
The index was a large, fragmented local partition index created in 2013 and accessed concurrently by nodes 3 and 4.
Cross‑node access to this index triggered Oracle bug 16344544, which can cause deadlock and hang in RAC environments when ‘gc current request’ and ‘gc cr multi‑block request’ contend.
The bug affects Oracle versions 11.2.0.3 and 11.2.0.4 (and earlier than 12.1); no patch exists, only a workaround by setting the hidden parameter _gc_bypass_readers=FALSE.
6. Solution
Rebuild index IDX_HQMHLCD_RZ_1 to reduce fragmentation and improve access efficiency.
Isolate application workloads to avoid cross‑node access that triggers GC contention.
Set hidden parameter _gc_bypass_readers=FALSE after thorough testing before production rollout.
The team first rebuilt the index and then proceeded with application isolation, which fully resolved the issue.
7. Outcome
After applying the index rebuild and isolation measures, the database returned to normal operation with no further hangs. The three‑week investigation demonstrated the value of systematic fault analysis and collaboration among DBA teams.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
