Big Data 12 min read

How to Build Real-Time Active‑Active Disaster Recovery for OLAP MPP Clusters

This article explains why disaster‑recovery and active‑active architectures are essential for OLAP MPP data‑warehouse clusters, outlines the specific RPO/RTO requirements for batch and real‑time workloads, and compares several data‑synchronization techniques and active‑active deployment models with their advantages and drawbacks.

ITPUB

Mar 14, 2023

How to Build Real-Time Active‑Active Disaster Recovery for OLAP MPP Clusters

Background

The 14th Five‑Year Plan for Big Data Industry Development emphasizes big data as a core pillar of the digital economy. Consequently, comprehensive disaster‑recovery (DR) and active‑active (双活) solutions for OLAP MPP (Massively Parallel Processing) clusters have become essential.

DR Requirements for OLAP MPP Clusters

Traditional batch‑oriented data warehouses (e.g., T+1, T+0.X) typically tolerate hour‑level RPO (Recovery Point Objective) and RTO (Recovery Time Objective). Real‑time analytical warehouses, however, demand minute‑ or second‑level RPO/RTO, often approaching zero, to guarantee continuous service and data consistency.

Why OLAP MPP Differs from OLTP

OLTP databases rely on write‑ahead logging (WAL) for backup and replication, enabling straightforward DR designs. OLAP MPP systems prioritize high‑throughput analytical workloads and cannot adopt transaction‑log‑based DR without severe performance penalties. Therefore, OLAP DR solutions focus on data‑block or file‑level synchronization rather than transaction logs.

Common Data‑Synchronization Methods

Transaction‑log & data‑block sync (Method 1.1)

Pros: Syncs only changed data blocks, minimizing data transfer; suitable for warehouses with low change rates.

Cons: Intrusive to the primary cluster; under heavy load the sync latency grows, and it does not scale for large‑volume changes.

Backup‑restore sync (Method 1.2)

Pros: Leverages existing backup tools; simple to implement when incremental backup is supported.

Cons: Requires the database to expose incremental backup APIs; cannot achieve RPO = 0; long RTO due to full/partial restore; high storage cost for backup files.

Import‑export sync (Method 1.3)

Pros: Table‑level synchronization via exported files; allows heterogeneous primary and standby engines (e.g., different product versions).

Cons: High coupling with application scheduling and schema design; operationally complex; difficult to automate.

Active‑Active DR Modes

ETL dual‑processing mode (Mode 2.1)

Pros: Independent of native DB DR capabilities; application controls two parallel ETL pipelines.

Cons: Duplicates CPU, memory, and I/O consumption on both clusters; nondeterministic SQL functions (e.g., NOW(), RANDOM(), ROW_NUMBER()) can cause data divergence. Users must rewrite such functions to guarantee identical results.

Product‑provided middleware scheduling (Mode 2.2)

Pros: Transparent to applications; the middleware handles task dispatch, validation, and synchronization using the underlying transaction‑log/data‑block sync.

Cons: Middleware becomes a single point of failure; its own HA must be engineered.

Application‑provided middleware scheduling (Mode 2.3)

Pros: Application‑level control enables real‑time or near‑real‑time consistency; primary cluster focuses on heavy data processing while standby serves query traffic.

Cons: Increases application complexity and coupling with DR logic.

Strong‑consistency real‑time sync via virtual & mirror clusters (Mode 2.4)

Pros: A unified scheduling cluster presents a single logical endpoint; mirror clusters maintain exact copies, enabling true real‑time active‑active across same‑city data centers.

Cons: Requires high‑speed (10 GbE) inter‑cluster networking; unsuitable for wide‑area deployments.

Implementation Highlights

For most OLAP MPP DR scenarios, the transaction‑log & data‑block synchronization (Method 1.1) offers the best trade‑off between latency and reliability. Selecting an active‑active mode depends on:

Network bandwidth and latency (especially for Mode 2.4).

Resource availability on primary and standby clusters (CPU, memory, I/O).

Tolerance for middleware complexity and potential single points of failure.

Need for deterministic query results (avoid nondeterministic functions).

Below are representative diagrams referenced in the original article:

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

High Availability Disaster Recovery OLAP Active-Active MPP

Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.