Sudden Daily Accounting Lag: DBA Forensics Reveal Oracle RAC Log I/O Bottleneck
On February 10, 2016 a provincial accounting database experienced severe daily‑batch delays despite other applications running normally; a senior DBA collected alert logs, AWR snapshots, ASH dumps, and OSW metrics, uncovered log‑file‑sync and redo‑IO degradation, increased rollbacks, and a faulty SAN link, pinpointing the root cause.
Problem Overview
On 2016‑02‑10 a provincial accounting system’s daily batch job (日账) showed a dramatic increase in execution time while other applications remained unaffected.
Data Collection
The DBA gathered the following evidence from the 11g R2 RAC cluster:
Alert logs from both nodes.
Snapshot (AWR) reports for the incident period and normal periods for comparison.
OSW logs covering the incident hour and one hour before/after.
All trace files generated during the incident.
ASH dump data for the incident window.
Log Analysis
Key findings from the collected data:
LGWR trace showed log‑write latency exceeding 500 ms, indicating redo I/O problems.
ASH dump revealed that 44 % of sessions were blocked on log file sync, which in turn waited on log file parallel write.
AWR reports showed log file sync wait times up to 431 ms, a surge in Global Cache (GC) waits, and occurrences of log buffer space waits.
Rollback rate jumped from 0.5 /s to 3.5 /s (≈7× increase) after the incident.
GC receive block count rose 20 % and private network traffic grew from 21 MB to 26 MB, suggesting possible application‑side changes.
VMSTAT indicated CPU was largely pending on I/O (B column >10) during the batch window.
MPSTAT showed several disks at 100 % busy with average service times >80 ms.
Root‑Cause Investigation
Collaboration with host, storage, and application teams yielded no obvious anomalies on the host or storage sides. However, the storage architecture uses Veritas mirroring; one of the two SAN links suffered a failure, halving the write rate to the disaster‑recovery copy and causing the observed redo‑IO slowdown.
Conclusion
Degraded log file sync and log file parallel write performance caused severe redo I/O bottlenecks. Potential application‑side adjustments increased GC traffic; further investigation is recommended. The seven‑fold increase in rollbacks warrants joint analysis with the application team.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
