Databases 17 min read

MySQL Disaster Recovery: Multi‑Region Three‑Center Replication and RTO/RPO Optimization

This article explains the principles of disaster recovery for MySQL, covering RTO/RPO metrics, national backup level standards, common master‑slave topologies, a comparative analysis of high‑availability solutions, and a detailed three‑center multi‑region replication design with code patches to avoid replication loops.

Aikesheng Open Source Community

Oct 31, 2023

MySQL Disaster Recovery: Multi‑Region Three‑Center Replication and RTO/RPO Optimization

1 Background Introduction

In the era of cloud‑network convergence and big data, data has become a critical production factor. After large‑scale data loss incidents such as Prism, EternalBlue, and the Wenchuan earthquake, China introduced laws like the 2016 Cybersecurity Law and the 2021 Data Security Law to regulate data protection.

When disasters occur, disaster‑recovery backup ensures data is not lost. Achieving application disaster recovery relies on real‑time database synchronization and replication so that a failure at site A can be smoothly switched to site B, meeting business continuity requirements despite some latency.

2 Overview of Disaster Recovery

Definition

Disaster recovery ensures that when a data center experiences unknown disasters, data loss is minimized and IT services continue or quickly switch back.

Disaster Metrics

Two key reliability indicators are RTO and RPO.

RTO (Recovery Time Objective) is the maximum tolerable downtime from the moment a system outage occurs until the system is restored to support business operations.

RPO (Recovery Point Objective) is the maximum tolerable data loss, i.e., the point in time to which data can be restored after a disaster.

RTO concerns service downtime, while RPO concerns data loss; they are independent metrics.

Disaster Levels

Since 1 Nov 2007, China’s national standard GB/T 20988‑2007 defines six disaster‑backup levels.

Level

Description

Level 1

Basic – off‑site backup media, safety assurance, periodic verification.

Level 2

Backup site support – network and business systems can be allocated to the backup center within a scheduled time.

Level 3

Electronic transmission and partial equipment support – backup center equipped with some business processing and network devices.

Level 4

Electronic transmission and full equipment support – periodic batch data transfer, always‑ready network/system (warm‑standby).

Level 5

Real‑time data transfer and full equipment support – remote replication, automatic or centralized network switching, business systems ready or running.

Level 6

Zero data loss and remote cluster support – real‑time backup, zero loss, remote cluster with automatic switch, users access both primary and backup centers.

Relationship Between Disaster Levels, RTO and RPO

Disaster Recovery Capability Level

RTO

RPO

More than 2 days

1–7 days

After 24 hours

1–7 days

More than 12 hours

Several hours to 1 hour

Several hours to 2 days

Several hours to 1 hour

Several minutes to 2 days

0–30 minutes

Several minutes

Two‑Region Three‑Center Disaster Recovery

The "two‑region three‑center" model combines local high‑availability, same‑city disaster‑recovery center, and remote disaster‑recovery center to improve availability and business continuity. Multiple data centers operate in primary‑backup relationships, allowing flexible response and switch cycles based on the incident.

3 MySQL Common Master‑Slave Forms

MySQL provides built‑in master‑slave replication to address data consistency, checkpoint mechanisms, and reliable network transmission, enabling high‑availability switching and read/write separation.

One Master One Slave

This topology provides a backup node; if the primary fails, a failover can occur without data loss.

One Master Multiple Slaves

This common architecture is simple, provides HA, and enables read/write separation to improve concurrency.

Multiple Masters One Slave

Multiple masters replicate to a single high‑performance storage server for unified analysis.

Dual‑Master Replication

Each master acts as a slave of the other, allowing seamless failover with second‑level switching.

Cascading Replication

In cascading mode, some slaves replicate from other slaves, reducing load on the primary while preserving data consistency.

4 Two‑Region Three‑Center MySQL Master‑Slave Replication

Common High‑Availability Solutions – Advantages and Disadvantages

HA Solution

Advantages

Disadvantages

Master‑Slave + Keepalived

Simple deployment, no primary election after failure.

After failover, other slaves need reconfiguration to connect to the new master.

MHA

Supports one‑master‑multiple‑slaves; no data inconsistency on primary crash.

SSH security risks; no longer maintained.

Group Replication (MGR)

Zero latency, strong data consistency.

Strong network dependency; only works in GTID mode; large transactions and DDL may block.

MySQL InnoDB Cluster

Provides automated failover missing in group replication.

Many components; few mature cases.

Orchestrator

Supports one‑master‑multiple‑slaves; solves single‑point of management node; CLI and web UI.

Complex features; integration difficulty.

MySQL Master‑Slave Initialization Messages

During channel establishment, the slave performs network connection, authentication, instance uniqueness, clock, charset, and binlog configuration checks. The slave obtains the master’s server_id for uniqueness verification.

MySQL Binlog Structure

Replication relies on binlog files composed of multiple binlog events, each consisting of a head, data, and footer. The head contains the originating instance’s server_id, which is used to identify the source of an event.

Two‑Region Three‑Center MySQL Replication Scheme 1

This scheme builds native MySQL master‑master replication within each data center, allowing failover by setting read_only. It avoids third‑party components, offering high maturity and scalability, but may encounter replication loops when multiple master‑master links exist.

Two‑Region Three‑Center MySQL Replication Scheme 2

To eliminate loops, the boundary MySQL instance checks the remote master’s server_id and only replicates binlog events generated by that master, discarding cascaded logs from other masters.

Boundary Node MySQL Replication Logic Patch

The patch, based on MySQL 5.7.40 community edition, modifies sys_vars.cc to add a replicate_server_mode variable (default 0). When set to 1, replication only synchronizes binlog events from the remote master of the current channel.

It also updates log_event.cc, altering the Log_event::do_shall_skip function to compare the event’s server_id with the remote master’s server_id; mismatched events are skipped, preventing loop‑induced data conflicts.

5 Summary

The proposed MySQL data synchronization scheme enhances MySQL’s native log replication by introducing multi‑channel master‑master replication, reducing the complexity of disaster‑recovery drills and cut‑over. Each channel only syncs the nearest master’s binlog events, solving loop problems and supporting two‑region three‑center disaster recovery without third‑party HA components, keeping code changes under 100 lines and limiting risk to boundary nodes. The approach lowers failover time from 30 minutes to 5 minutes (or <2 minutes with automation), improves recovery success rate above 98 %, and can be extended to higher‑level disaster‑recovery scenarios.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

database High Availability mysql Disaster Recovery replication RPO RTO

Written by

Aikesheng Open Source Community

The Aikesheng Open Source Community provides stable, enterprise‑grade MySQL open‑source tools and services, releases a premium open‑source component each year (1024), and continuously operates and maintains them.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

1 Background Introduction

2 Overview of Disaster Recovery

Definition

Disaster Metrics

Disaster Levels

Relationship Between Disaster Levels, RTO and RPO

Two‑Region Three‑Center Disaster Recovery

3 MySQL Common Master‑Slave Forms

One Master One Slave

One Master Multiple Slaves

Multiple Masters One Slave

Dual‑Master Replication

Cascading Replication

4 Two‑Region Three‑Center MySQL Master‑Slave Replication

Common High‑Availability Solutions – Advantages and Disadvantages

MySQL Master‑Slave Initialization Messages

MySQL Binlog Structure

Two‑Region Three‑Center MySQL Replication Scheme 1

Two‑Region Three‑Center MySQL Replication Scheme 2

Boundary Node MySQL Replication Logic Patch

5 Summary

Aikesheng Open Source Community

How this landed with the community

Was this worth your time?

0 Comments

Two‑Region Three‑Center MySQL Replication Scheme 1

Two‑Region Three‑Center MySQL Replication Scheme 2