Why Salesforce Lost 5 Hours of Data and How Oracle GoldenGate Can Prevent It
The article examines Salesforce's 2016 five‑hour data loss caused by a data‑center outage, explains why Oracle‑based backups failed, and presents Oracle GoldenGate and Active Data Guard as robust disaster‑recovery solutions for cloud databases.
Background
Salesforce runs a multi‑tenant SaaS platform (Force.com, Heroku, Wave) on Oracle databases. Its data‑centers are located on both U.S. coasts, Japan, Singapore and Dublin.
2016 NA14 data‑center outage
On 10 May 2016 a power failure at the NA14 facility caused a service interruption that lasted more than 24 hours. Salesforce later confirmed that any data written between 09:53 UTC and 14:53 UTC could not be restored, resulting in roughly five hours of permanent customer‑data loss.
Root cause and lessons
The incident exposed a weakness in the backup and disaster‑recovery (DR) process: the primary database was restored to an earlier point without a reliable standby or continuous replication, and the restoration attempt itself introduced additional data loss. The case demonstrates that even large cloud providers need a proven, near‑zero‑RPO DR architecture.
Solution 1: Oracle GoldenGate
GoldenGate provides real‑time logical replication and change data capture (CDC). It streams committed transactions from the source to one or more target databases, keeping the standby in sync with sub‑second latency.
Typical workload: production environments with up to 1 TB of daily redo/log generation per instance have been successfully protected.
Key components: Extract – reads redo logs on the primary. Trail files – portable files that hold captured changes. Replicat – applies changes on the target.
Implementation steps:
Enable ARCHIVELOG mode and ensure sufficient redo log size.
Install GoldenGate on both primary and standby servers.
Create an Extract process that reads the primary redo logs.
Configure Trail files on a shared storage or network location.
Set up a Replicat process on the standby to apply the captured changes.
Validate data consistency with GGSCI> INFO REPLICAT, DETAIL and periodic checksum comparisons.
Schedule monthly failover drills to verify that the standby can be promoted without data loss.
Operational considerations:
Strict change‑control procedures to avoid configuration drift.
Monitoring of lag (seconds) and throughput (MB/s) via GoldenGate’s GGSCI> INFO EXTRACT commands.
Integration with Oracle Enterprise Manager for alerting.
Solution 2: Oracle Active Data Guard (ADG)
Active Data Guard creates a physical standby database that continuously receives and applies redo data from the primary. It can be opened read‑only for reporting while still applying changes, and supports a configurable apply‑delay to protect against logical errors.
Version requirement: Oracle Database 11g Release 2 or later.
Key features:
Automatic block‑level redo transport and apply.
Fast switchover and failover with minimal RTO.
Read‑only access to the standby for reporting or backup without impacting the primary.
Implementation steps:
Create a physical standby using RMAN DUPLICATE or manual datafile copy.
Enable Data Guard broker (optional) for simplified management.
Configure redo transport (SYNC or ASYNC) and apply settings (e.g., LOG_ARCHIVE_DEST_2).
Set FAL_SERVER and FAL_CLIENT parameters for fast‑application‑log recovery.
Test switchover with DGMGRL> SWITCHOVER TO standby; and failover with DGMGRL> FAILOVER TO standby;.
Combining ADG with GoldenGate: Use ADG for fast physical standby and GoldenGate for logical replication to heterogeneous targets or for multi‑site active‑active architectures.
Implementation checklist
Document the RPO/RTO objectives (e.g., RPO ≤ 5 seconds, RTO ≤ 15 minutes).
Choose the primary DR technology (ADG, GoldenGate, or ADG + GoldenGate) based on workload and budget.
Provision standby hardware or cloud instances with matching CPU, memory, and storage capacity.
Configure network bandwidth to support continuous redo transport (minimum 1 Gbps recommended for high‑volume environments).
Automate backup retention, archivelog cleanup, and standby health checks.
Schedule regular (monthly) failover and switchover drills; record results and adjust procedures.
Monitor key metrics: redo lag, apply lag, replication throughput, and error alerts.
Conclusion
The 2016 Salesforce outage illustrates that reliance on a single Oracle primary without continuous replication can lead to irreversible data loss. Deploying Oracle GoldenGate for real‑time logical replication, Oracle Active Data Guard for physical standby protection, or a combined ADG + GoldenGate architecture provides an enterprise‑grade DR solution that meets stringent RPO/RTO requirements and safeguards critical SaaS data.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
