Choosing Low‑Risk Strategies for Critical DBA Outages
When a major operations incident strikes, the safest approach is to prioritize simple, low‑risk actions and accept limited responsibility, as illustrated by real DBA lessons from Oracle RAC failures and a data‑center power‑loss disaster.
Principle 1: Choose the simplest, lowest‑risk remediation
When several remediation options exist, prioritize the one that introduces the least operational risk and requires the smallest scope of responsibility. If system performance is degraded but still within acceptable business thresholds, it may be safer to tolerate the degradation rather than perform a high‑risk fix.
Typical scenario – Oracle RAC node failure
For a high‑load RAC cluster where a node crashes, the recommended procedure is:
Inspect the alert logs and trace files of the surviving nodes for errors that could indicate an imminent failure.
Monitor active sessions, session counts, CPU load, I/O statistics, and wait events on the surviving nodes.
If any risk indicators are found, terminate offending sessions (e.g., ALTER SYSTEM KILL SESSION '<sid,serial#>') to stabilize the cluster before further analysis.
If risk cannot be assessed and the incident occurs during peak business hours, defer a node restart until after the peak window.
Avoid restarting a failed node immediately after a failover while the workload has not yet stabilized; many severe incidents stem from this premature action.
Principle 2: Do not assume full control of the environment
DBAs operate in complex data‑center environments with many unknown variables (power, cooling, storage, network). Decisions should leave room for uncertainty and avoid “optimal‑on‑paper” solutions that ignore hidden constraints.
Case study – Dual power‑loss event in a data center
Approximately fifteen years ago a data center suffered a simultaneous loss of both utility feeds because the two upstream substations shared a single high‑voltage source. The power outage lasted several hours, exceeding the UPS endurance for the core storage environment.
Two conflicting strategies were proposed:
DBA recommendation: Shut down core database servers and storage immediately, keep peripheral systems running, and rely on UPS for a short‑term bridge. This limits heat buildup in the server hall and prevents temperature‑induced automatic protection of storage arrays.
IT manager decision: Keep core systems online, use ice trucks to cool the hall, and only shut down peripheral equipment.
The latter approach caused the data‑center temperature to exceed design limits, triggering automatic protection of the core storage arrays, resulting in bad blocks, tape damage, and loss of database files.
Recovery required:
Force‑mount the databases with BBED (or an equivalent low‑level recovery tool) to bypass normal startup checks.
Export whatever data could be read.
Re‑create the databases and reload the exported data, supplementing missing rows from backups.
Core services were restored after two days for internal users and after one week for external customers.
The incident illustrates that, in severe outages, the safest course is to stay within one’s competence, choose low‑risk actions, and accept a limited scope of responsibility to protect both the system and the DBA’s professional standing.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
