Why Did Oracle 10g Hang? Uncovering Log File Sync Bottlenecks and Storage I/O Issues
A detailed Oracle 10g case study reveals how excessive log file sync waits, caused by I/O performance problems and frequent log switches on aging mechanical storage, lead to database hangs, and provides step‑by‑step analysis and practical mitigation tips.
This case study examines a severe performance incident on an Oracle 10g database where the entire application stalled at 10 am due to the database becoming unresponsive, a problem traced back to storage issues identified during a prior health check.
1. Abnormal Wait Analysis
Automatic Workload Repository (AWR) data shows that log file sync accounts for 64.2% of wait events, classifying it as a commit‑type wait.
What is log file sync? When a user session commits, all redo generated by the transaction must be flushed from memory to the redo log file to guarantee durability.
2. Root‑Cause Investigation
Why does log file sync wait become so high?
其中的最常见的原因有2个
1.影响 LGWR 的 I/O 性能问题
2.过多的应用程序 commit2.1 Analyze Program Commits
Compare the average ratio of user commits/rollbacks to user calls to determine if commit frequency is abnormal.
The calculated average user calls/(user commits+user rollbacks) is 60.85, meaning roughly one commit occurs every 60.85 calls—commit frequency is not excessive.
Next, verify whether the LGWR (log writer) switch rate is abnormal.
Oracle recommends a log switch every 15‑20 minutes (3‑4 times per hour). A per‑hour count exceeding this suggests the redo log file is too small.
2.2 Analyze I/O Performance
Compare the average wait times of “log file sync” and “log file parallel write”.
The majority of log file sync time is spent in log file parallel write, indicating that I/O delays in writing redo are the primary bottleneck.
Experience shows that an average “log file parallel write” time exceeding 5‑10 ms typically points to storage I/O problems.
Blocking transactions also reveal heavy “log file parallel write” blocking “log file sync”, confirming disk I/O degradation.
The client later confirmed that the storage device was a mechanical disk that had failed, causing a severe drop in I/O performance.
3. Alert Log Examination
The alert.log displayed errors that reinforced the above diagnosis.
When the database switches logs, all private strands must flush to the current log before proceeding; the message indicates that the switch was attempted before all redo information was fully written.
Private strands, introduced in Oracle 10gR2, manage redo allocation latches, allowing multiple latches to write redo to the buffer cache more efficiently.
4. Conclusions
Do not place redo logs on legacy mechanical disks; peak write loads can cause severe log file sync waits, leading to database instability or hangs.
Monitor other processes that may write to the same path and ensure the disk provides sufficient bandwidth for the required workload.
Keep LOG_BUFFER at a reasonable size—an excessively large buffer increases flush wait times.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
