Cloud Computing 12 min read

How Shandong Mobile Achieved Zero‑Downtime Dual‑Active Disaster Recovery in the Cloud

This article details Shandong Mobile's cloud‑based dual‑active architecture, covering virtualization, VMware HA/DRS, Oracle GoldenGate and ADG optimizations, large Layer‑2 network designs, heartbeat mechanisms to avoid split‑brain, comprehensive testing scenarios, and the resulting reductions in downtime and hardware costs.

dbaplus Community
dbaplus Community
dbaplus Community
How Shandong Mobile Achieved Zero‑Downtime Dual‑Active Disaster Recovery in the Cloud

Background and Cloud Transition

After migrating to the cloud, Shandong Mobile faced new challenges: virtualization technologies and the shift to clustered x86 environments made the original design unsuitable, requiring a business‑continuous dual‑active solution.

Scenario 1: Third‑Generation CRM with EBUS Cross‑Center Active‑Active Cluster

The third‑generation CRM introduced a distributed service bus layer (EBUS). Because EBUS operates as a service cluster, extensive configuration and high consistency requirements demand a distributed coordination mechanism for dual‑active design.

Scenario 2: VMware Virtualization Platform Dual‑Active Design

Using storage‑array dual‑active and VMware cross‑site clustering, the solution leverages VMware HA for automatic failover and DRS for intelligent resource distribution. Key infrastructure components include:

Two data‑center servers forming a single cluster with HA and DRS for high availability and dynamic resource allocation.

10 GbE links providing heartbeat and vMotion traffic; all servers must comply with cluster compatibility rules.

Key Point 1: Large Layer‑2 Network Options

All scenarios except design 4 require a cross‑center large Layer‑2 network. Options evaluated:

OTV technology to bridge VLANs across three layers.

Direct fiber connections.

MPLS‑based VPLS interconnect.

Analysis shows the direct‑connect solution offers the highest efficiency, followed by overlay methods and MPLS.

Key Point 2: Oracle GoldenGate Dual‑Active Data‑Sync Optimization

Performance bottlenecks were identified in the data‑replication pipeline:

Extract process : Log generation 30‑50 GB/h, CPU 1.9 %. Bottleneck at LCR‑to‑UDF conversion. Recommendations: split Extract processes, group tables by schema, tune parameters (eofdelay, flushsecs), increase log read interval to 3 s and memory flush interval.

Pump process : Log generation 15‑30 GB/h, CPU 7 %, bandwidth 1 GB/min (10‑15 Mb/s). Recommendations: ensure primary tables have primary keys or unique indexes, enable data compression, enlarge TCP buffers.

Replicat process : Typical throughput 1 GB queue/h. Optimizations: merge small transactions, split large transactions (maxtransops), partition Replicat by table or range.

Key Point 3: Oracle ADG Dual‑Active Performance

For Oracle 11g ADG, observed metrics include:

Daily archived log volume 1300 GB (600 GB on node 1, 700 GB on node 2).

Peak hourly log volume 183 GB.

Network: 1 GbE link, average bandwidth 16.24 MB/s, peak 52 MB/s.

Transport + Apply lag average 0.65 s, typically <10 s during normal operations.

Key Point 4: Oracle Extended RAC + IBM GPFS A‑A Parameters

Critical parameters ensure RAC disk arbitration occurs after GPFS arbitration, guaranteeing GPFS makes the decision first during network failures.

Key Point 5: In‑Memory Database Synchronization Performance

Tests of Oracle ↔ TT synchronization showed:

When Oracle updates <150 k rows, TT cache refreshes within 30 s (≈600 MB base tables).

Large batch loads cause non‑linear performance degradation; recommended to split into smaller transactions.

TT master‑slave asynchronous mode peaks at ~1 GB/min; beyond this, backlog occurs.

Large transactions (≈10 k rows, 35 MB) may trigger timeouts; in “friendly” mode, the standby aborts the transaction and continues, while in strict mode the primary waits, causing blockage.

Key Point 6: Heartbeat and Network Design to Prevent Split‑Brain

Due to long distances between data centers, additional redundancy is required for network, internal network, and SAN links. Design recommendations include:

Ring‑topology with full redundancy and a robust arbitration mechanism.

VPLEX witness device to act as a tie‑breaker, ensuring only one site continues I/O during a link failure.

RAC heartbeat parameters: misscount (RAC network heartbeat) and disktimeout (disk arbitration).

Key Point 7: Comprehensive Planned and Unplanned Test Scenarios

Dual‑active systems involve cross‑center network, data, and storage layers, leading to more complex fault scenarios than traditional architectures. Thorough testing of inter‑dependencies and failure modes is essential.

Dual‑Active Results

Pilot deployments achieved:

Zero‑cutover capability, reducing BOSS system downtime windows by 40‑60 % (from ~150 h/year to ~70 h/year).

Recovery time reduced from ~30 minutes to under 5 minutes.

Hardware investment savings of ~40 % by leveraging disaster‑recovery resources.

Overall, the project lowered operational risk and improved customer satisfaction.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

disaster recoveryVirtualizationDatabase Replicationnetwork design
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.