Databases 11 min read

How a Hidden Oracle RAC Bug Caused a Database Hang and the Steps We Took to Resolve It

A production Oracle RAC database on IBM Power8 experienced a severe hang due to incomplete checkpoints caused by a bug in index handling, and the detailed analysis, root‑cause identification, and three‑step remediation—including index rebuild, application isolation, and a hidden parameter tweak—ultimately restored full service.

dbaplus Community
dbaplus Community
dbaplus Community
How a Hidden Oracle RAC Bug Caused a Database Hang and the Steps We Took to Resolve It

1. Fault Phenomenon

During production hours a client’s core system showed session backlog and numerous abnormal wait events; all online redo log groups remained ACTIVE, preventing checkpoint completion and causing the database instance to hang.

A similar issue recurred the following week, but a pre‑planned response limited impact and allowed more data collection for a final resolution.

2. Environment

IBM Power8 E880, AIX 7.1, Oracle 4‑node RAC 11.2.0.3.15.

3. Fault Pre‑Plan

Deploy SMS alerts to monitor the number of ACTIVE log groups and trigger warnings when exceeding four.

When the issue appears, add usable log groups to delay the hang and give diagnostic time.

Emergency stop the hung node; RAC’s high‑availability architecture prevents single‑point failure.

4. Fault Handling Process

To enable post‑mortem analysis, a hanganalyze was run and a system dump was collected. Because the system was very busy, the dump took 60–90 minutes; after about ten minutes the dump was aborted and node 2 was stopped, allowing the database to recover by 11:48 AM.

5. Resource Usage Analysis

Comparing Friday’s load with normal days showed overall CPU and I/O usage remained within normal ranges across all four nodes.

5. Fault Cause Analysis

5.1 Checkpoint Incompletion

Initial suspicion fell on the archive log directory, but space and read/write checks were fine. The dump was then examined for deeper clues.

5.2 System Dump Findings

Four blocking sources were identified:

Two sequence objects (SEQ_LNCH_XXXX_TRIGID and SEQ_LNCH_XXXX_TRIGLOGID) waiting on row cache locks, indicating data‑dictionary contention.

An INSERT statement blocker holding a session on node 2 (LGWR process) waiting for log file switch (checkpoint incomplete).

UPDATE statements blocked while waiting for log switch completion.

The log‑switch checkpoint, initiated by CKPT, requires LGWR to read the control file to obtain the checkpoint queue.

Analysis showed DBWR processes reported ‘rdbms ipc message’ (idle), suggesting the system believed no dirty buffers existed, yet the dump revealed a dirty block queue containing an index object (IDX_HQMHLCD_RZ_1) linked to table UDP_PROD31.QFCHL_XXXXACTION_LOG.

Further log inspection showed the index’s related SQL statements suffered GC current request and GC CR multi‑block request waits, confirming heavy contention.

The index was a large, fragmented local partition index created in 2013 and accessed concurrently by nodes 3 and 4.

Cross‑node access to this index triggered Oracle bug 16344544, which can cause deadlock and hang in RAC environments when ‘gc current request’ and ‘gc cr multi‑block request’ contend.

The bug affects Oracle versions 11.2.0.3 and 11.2.0.4 (and earlier than 12.1); no patch exists, only a workaround by setting the hidden parameter _gc_bypass_readers=FALSE.

6. Solution

Rebuild index IDX_HQMHLCD_RZ_1 to reduce fragmentation and improve access efficiency.

Isolate application workloads to avoid cross‑node access that triggers GC contention.

Set hidden parameter _gc_bypass_readers=FALSE after thorough testing before production rollout.

The team first rebuilt the index and then proceeded with application isolation, which fully resolved the issue.

7. Outcome

After applying the index rebuild and isolation measures, the database returned to normal operation with no further hangs. The three‑week investigation demonstrated the value of systematic fault analysis and collaboration among DBA teams.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

OracleRACBug 16344544Database HangHanganalyzeIndex RebuildLog Switch
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.