Databases 6 min read

How a Faulty Heartbeat Cable Crashed Our Oracle DB – A Detailed Post‑mortem

During a New Year holiday, an Oracle database suffered severe hangs due to a GC buffer busy acquire wait event caused by a faulty heartbeat network cable, leading to index unusable errors and extensive troubleshooting steps that reveal common causes and preventive measures.

dbaplus Community
dbaplus Community
dbaplus Community
How a Faulty Heartbeat Cable Crashed Our Oracle DB – A Detailed Post‑mortem

1. Initial Diagnosis

The outage occurred early morning when the core application froze; the Oracle database showed gc buffer busy acquire wait events and index/row lock contention. Alert logs revealed index unusable errors, suggesting a bug triggered by massive partition index failures (Doc ID 849070.1). Immediate action was to stop the service and rebuild the indexes.

-- Any global or local index becomes UNUSABLE when data moves:
1) TRUNCATE or DROP on a partition with data invalidates the global index; the partition index remains valid.
   ADD on a partition does not affect any index.
2) EXCHANGE operation makes both global and partition indexes UNUSABLE (unless INCLUDING INDEXES is used, then only the global index is affected).
3) SPLIT on a partition with data makes both indexes UNUSABLE; if the target partition is empty, indexes stay valid.
4) MOVE operation invalidates both global and partition indexes.
5) Manually set an index UNUSABLE: ALTER INDEX IND_OBJECT_ID UNUSABLE;
   For partitioned tables, TRUNCATE, DROP, EXCHANGE, and SPLIT cause global index loss, but adding UPDATE GLOBAL INDEXES clause can preserve them.

2. Secondary Diagnosis

After fixing the index issue, the gc buffer busy acquire wait persisted. ADDM reports showed SQL statements consuming heavy I/O and highly variable execution plans. The team collected statistics and bound execution plans, then reduced parallelism after discovering excessive parallel execution.

3. Final Diagnosis

The remaining wait events were traced to network problems. AWR reports indicated high heartbeat latency on instance 2. System logs showed a network card repeatedly going down/up, and ping tests confirmed heartbeat latency up to 358 ms between nodes.

4. Reflection

The root cause was a hardware‑level fault: a poorly connected heartbeat cable between the two database nodes caused the gc buffer busy acquire wait, eventually freezing the database. The investigation initially focused on SQL and index issues, overlooking the network layer.

5. Remediation Measures

To avoid recurrence, the direct‑connect heartbeat line will be replaced with a bonded NIC configuration or moved to a switch‑based heartbeat network, eliminating single‑point physical contact problems.

Loose or damaged cables can destabilize the cluster and cause node eviction.

Limiting the cluster to two nodes prevents scaling.

Re‑loosened cables can re‑trigger GC wait events.

Conclusion

Comprehensive monitoring and thorough data collection enable faster fault localization and resolution, turning chaotic incidents into actionable lessons.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

troubleshootingOracleDatabase Performancegc buffer busy acquireIndex UnusableNetwork Issue
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.