Why Oracle RAC GC Waits Spike: Uncovering Hidden Network and IRQ Bottlenecks
A detailed DBA interview walks through the diagnosis of excessive Oracle RAC GC wait events, revealing that cross‑instance queries, network latency, NIC bonding mode, and unbalanced IRQ handling together caused the slowdown, and shows step‑by‑step fixes that finally eliminated the problem.
In a conversational interview, DBA A explains that Oracle RAC’s shared‑disk architecture can generate GC buffer busy acquire waits when a node requests data cached on a remote instance, especially during heavy cross‑instance access.
The problematic SQL (identified by sql_id='05uqdabhzncdc') shows many gc buffer busy acquire events in its execution plan, confirming the GC wait symptom.
Initial suspicion points to network latency. An AWR report for a two‑node RAC cluster reveals unusually high ping latencies (30 ms for 500‑byte pings and 10 ms for 8 KB pings) between nodes. Traceroute output shows occasional spikes up to 5 ms, while idle measurements are near 0 ms, indicating the latency is load‑dependent.
Further investigation shows the database’s private NIC (bond0) handling over 100 MB/s of traffic, and the server uses NIC bonding mode 4 (LACP), unlike other servers that use mode 0, 1, or 6. After changing the bonding mode and replacing hardware, latency remains high.
OSWatch data uncovers that CPU 15’s %soft usage hits 100 % during peak hours, suggesting IRQ overload. The article explains that NIC interrupts are delivered to a single CPU core, overwhelming it, and that the irqbalance service alone cannot resolve the issue because the interrupts are pinned.
To rebalance IRQs, the following steps are performed:
Stop the irqbalance service: Service irqbalance off Identify the NIC’s interrupt numbers: cat /proc/interrupts | grep -i ethx Check current CPU affinity for each interrupt: cat /proc/irq/126/smp_affinity Reassign interrupts to different CPUs, e.g.: echo 16 > /proc/irq/126/smp_affinity If manual rebinding is cumbersome, a Huawei driver script can automate the process.
After reassigning IRQs, soft‑IRQ load drops but still concentrates on a single core. Further research suggests two additional optimizations:
Enable IRQ coalescing to batch interrupts, reducing CPU wake‑ups: ethtool -C ethx adaptive-rx on Configure UDP flow hashing so packets are distributed across receive queues:
ethtool --config-ntuple ethx rx-flow-hash udp4 sdfnApplying both settings reduces soft‑IRQ usage to 30‑60 %, network latency falls below 0.01 ms, and the GC wait events disappear.
The case demonstrates that diagnosing RAC performance issues requires looking beyond SQL, examining network topology, NIC bonding modes, and low‑level interrupt handling, and that systematic rebalancing and kernel‑level tuning can resolve seemingly “mystical” GC bottlenecks.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
