Databases 8 min read

Sudden Daily Accounting Lag: DBA Forensics Reveal Oracle RAC Log I/O Bottleneck

On February 10, 2016 a provincial accounting database experienced severe daily‑batch delays despite other applications running normally; a senior DBA collected alert logs, AWR snapshots, ASH dumps, and OSW metrics, uncovered log‑file‑sync and redo‑IO degradation, increased rollbacks, and a faulty SAN link, pinpointing the root cause.

dbaplus Community
dbaplus Community
dbaplus Community
Sudden Daily Accounting Lag: DBA Forensics Reveal Oracle RAC Log I/O Bottleneck

Problem Overview

On 2016‑02‑10 a provincial accounting system’s daily batch job (日账) showed a dramatic increase in execution time while other applications remained unaffected.

Data Collection

The DBA gathered the following evidence from the 11g R2 RAC cluster:

Alert logs from both nodes.

Snapshot (AWR) reports for the incident period and normal periods for comparison.

OSW logs covering the incident hour and one hour before/after.

All trace files generated during the incident.

ASH dump data for the incident window.

Log Analysis

Key findings from the collected data:

LGWR trace showed log‑write latency exceeding 500 ms, indicating redo I/O problems.

ASH dump revealed that 44 % of sessions were blocked on log file sync, which in turn waited on log file parallel write.

AWR reports showed log file sync wait times up to 431 ms, a surge in Global Cache (GC) waits, and occurrences of log buffer space waits.

Rollback rate jumped from 0.5 /s to 3.5 /s (≈7× increase) after the incident.

GC receive block count rose 20 % and private network traffic grew from 21 MB to 26 MB, suggesting possible application‑side changes.

VMSTAT indicated CPU was largely pending on I/O (B column >10) during the batch window.

MPSTAT showed several disks at 100 % busy with average service times >80 ms.

Root‑Cause Investigation

Collaboration with host, storage, and application teams yielded no obvious anomalies on the host or storage sides. However, the storage architecture uses Veritas mirroring; one of the two SAN links suffered a failure, halving the write rate to the disaster‑recovery copy and causing the observed redo‑IO slowdown.

Conclusion

Degraded log file sync and log file parallel write performance caused severe redo I/O bottlenecks. Potential application‑side adjustments increased GC traffic; further investigation is recommended. The seven‑fold increase in rollbacks warrants joint analysis with the application team.
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

performancestorageOraclelog analysisDBAAWRRAC
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.