Achieving Multi‑Active Disaster Recovery with Distributed Databases in Finance
Amid rising cloud outages and strict financial regulations, this article examines traditional multi‑active database solutions such as Oracle RAC and IBM GDPS, contrasts them with modern distributed database designs, and details SequoiaDB’s multi‑active architecture and concrete disaster‑recovery procedures for single‑node, site‑wide, and network failures.
Background
Regulators, especially in the financial sector, require architectures that guarantee continuous service and prevent data loss. Requirements such as “two‑site three‑center” and “dual‑active” mandate that a failure of a single data center must not interrupt business operations.
Traditional Multi‑Active Solutions
Typical on‑premise multi‑active designs include:
Oracle RAC : Multiple Oracle instances run in a single data center, sharing a SAN. Transactions, locks and other coordination are performed over a high‑speed intra‑center network.
IBM GDPS : IBM DB2 for z/OS clusters are deployed across several data centers. Data is replicated with QRep and a workload controller distributes tasks, providing active‑active service.
Distributed Database Multi‑Active Architecture
Distributed databases separate compute (SQL parsing/execution) from storage and transaction control. By deploying a three‑replica configuration across multiple data centers, each site hosts a local SQL service node. Applications connect via JDBC to the nearest node and are unaware of any master‑slave topology.
All nodes are peers; each can accept reads and writes. Consistency, lock handling and replication are managed internally by the database engine.
SequoiaDB Multi‑Active Implementation
SequoiaDB adopts a three‑replica model for same‑city disaster recovery:
Two replicas reside in the production site, one replica in a standby site.
Strong data‑synchronization (synchronous replication) ensures that an update is acknowledged only after all live nodes have persisted the change, achieving zero data loss.
For cross‑city disaster recovery, a separate SequoiaDB cluster is deployed in a remote data center with a single replica. Structured data is synchronized by replaying the transaction logs generated by the same‑city standby cluster.
Disaster‑Recovery Scenarios
Single‑Node Failure – The remaining two replicas continue to serve reads and writes. The failed node is repaired and automatically resynchronized, or manually restored if needed.
Whole Production Site Failure – When the production data center loses two of the three replicas, the standby replica can be split into a single‑node cluster, providing full read‑write service. The split operation typically completes within ten minutes.
Remote Disaster Site Failure – Because each data group retains two replicas in the production site, the system continues to accept reads and writes. Failed remote nodes are repaired and resynchronized as usual.
Network Partition – If the link between production and standby sites is lost, applications continue using the two local replicas. Once connectivity is restored, normal synchronization resumes.
Operational Details
Cluster Split / Takeover : When a production site becomes unavailable, SequoiaDB’s splitCluster command creates an independent single‑node cluster from the standby replica. The operation is fast (≈10 minutes) and restores write capability.
Cluster Merge : After the production site is restored, the split cluster must be merged using SequoiaDB’s mergeCluster command before both clusters can be started together, preventing split‑brain scenarios.
Strong Consistency Configuration : Enabling strong consistency forces the database to wait for acknowledgment from all live replicas before confirming a write. This guarantees zero data loss (RPO≈0) at the cost of higher write latency.
Log Replay for Remote Sync : The remote single‑replica cluster receives transaction logs from the same‑city standby cluster. Replaying these logs keeps the remote site up‑to‑date without requiring a full synchronous link.
Performance and Availability Outcomes
The three‑replica, compute‑storage‑separated design provides:
Second‑level RTO (recovery time objective) because a standby replica can be promoted quickly.
Near‑zero RPO (recovery point objective) due to synchronous replication.
Continuous read‑write capability across sites without application‑level master‑slave awareness.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
