MySQL Disaster Recovery: Multi‑Region Three‑Center Replication and RTO/RPO Optimization
This article explains the principles of disaster recovery for MySQL, covering RTO/RPO metrics, national backup level standards, common master‑slave topologies, a comparative analysis of high‑availability solutions, and a detailed three‑center multi‑region replication design with code patches to avoid replication loops.
1 Background Introduction
In the era of cloud‑network convergence and big data, data has become a critical production factor. After large‑scale data loss incidents such as Prism, EternalBlue, and the Wenchuan earthquake, China introduced laws like the 2016 Cybersecurity Law and the 2021 Data Security Law to regulate data protection.
When disasters occur, disaster‑recovery backup ensures data is not lost. Achieving application disaster recovery relies on real‑time database synchronization and replication so that a failure at site A can be smoothly switched to site B, meeting business continuity requirements despite some latency.
2 Overview of Disaster Recovery
Definition
Disaster recovery ensures that when a data center experiences unknown disasters, data loss is minimized and IT services continue or quickly switch back.
Disaster Metrics
Two key reliability indicators are RTO and RPO.
RTO (Recovery Time Objective) is the maximum tolerable downtime from the moment a system outage occurs until the system is restored to support business operations.
RPO (Recovery Point Objective) is the maximum tolerable data loss, i.e., the point in time to which data can be restored after a disaster.
RTO concerns service downtime, while RPO concerns data loss; they are independent metrics.
Disaster Levels
Since 1 Nov 2007, China’s national standard GB/T 20988‑2007 defines six disaster‑backup levels.
Level
Description
Level 1
Basic – off‑site backup media, safety assurance, periodic verification.
Level 2
Backup site support – network and business systems can be allocated to the backup center within a scheduled time.
Level 3
Electronic transmission and partial equipment support – backup center equipped with some business processing and network devices.
Level 4
Electronic transmission and full equipment support – periodic batch data transfer, always‑ready network/system (warm‑standby).
Level 5
Real‑time data transfer and full equipment support – remote replication, automatic or centralized network switching, business systems ready or running.
Level 6
Zero data loss and remote cluster support – real‑time backup, zero loss, remote cluster with automatic switch, users access both primary and backup centers.
Relationship Between Disaster Levels, RTO and RPO
Disaster Recovery Capability Level
RTO
RPO
1
More than 2 days
1–7 days
2
After 24 hours
1–7 days
3
More than 12 hours
Several hours to 1 hour
4
Several hours to 2 days
Several hours to 1 hour
5
Several minutes to 2 days
0–30 minutes
6
Several minutes
0
Two‑Region Three‑Center Disaster Recovery
The "two‑region three‑center" model combines local high‑availability, same‑city disaster‑recovery center, and remote disaster‑recovery center to improve availability and business continuity. Multiple data centers operate in primary‑backup relationships, allowing flexible response and switch cycles based on the incident.
3 MySQL Common Master‑Slave Forms
MySQL provides built‑in master‑slave replication to address data consistency, checkpoint mechanisms, and reliable network transmission, enabling high‑availability switching and read/write separation.
One Master One Slave
This topology provides a backup node; if the primary fails, a failover can occur without data loss.
One Master Multiple Slaves
This common architecture is simple, provides HA, and enables read/write separation to improve concurrency.
Multiple Masters One Slave
Multiple masters replicate to a single high‑performance storage server for unified analysis.
Dual‑Master Replication
Each master acts as a slave of the other, allowing seamless failover with second‑level switching.
Cascading Replication
In cascading mode, some slaves replicate from other slaves, reducing load on the primary while preserving data consistency.
4 Two‑Region Three‑Center MySQL Master‑Slave Replication
Common High‑Availability Solutions – Advantages and Disadvantages
HA Solution
Advantages
Disadvantages
Master‑Slave + Keepalived
Simple deployment, no primary election after failure.
After failover, other slaves need reconfiguration to connect to the new master.
MHA
Supports one‑master‑multiple‑slaves; no data inconsistency on primary crash.
SSH security risks; no longer maintained.
Group Replication (MGR)
Zero latency, strong data consistency.
Strong network dependency; only works in GTID mode; large transactions and DDL may block.
MySQL InnoDB Cluster
Provides automated failover missing in group replication.
Many components; few mature cases.
Orchestrator
Supports one‑master‑multiple‑slaves; solves single‑point of management node; CLI and web UI.
Complex features; integration difficulty.
MySQL Master‑Slave Initialization Messages
During channel establishment, the slave performs network connection, authentication, instance uniqueness, clock, charset, and binlog configuration checks. The slave obtains the master’s server_id for uniqueness verification.
MySQL Binlog Structure
Replication relies on binlog files composed of multiple binlog events, each consisting of a head, data, and footer. The head contains the originating instance’s server_id , which is used to identify the source of an event.
Two‑Region Three‑Center MySQL Replication Scheme 1
This scheme builds native MySQL master‑master replication within each data center, allowing failover by setting read_only . It avoids third‑party components, offering high maturity and scalability, but may encounter replication loops when multiple master‑master links exist.
Two‑Region Three‑Center MySQL Replication Scheme 2
To eliminate loops, the boundary MySQL instance checks the remote master’s server_id and only replicates binlog events generated by that master, discarding cascaded logs from other masters.
Boundary Node MySQL Replication Logic Patch
The patch, based on MySQL 5.7.40 community edition, modifies sys_vars.cc to add a replicate_server_mode variable (default 0). When set to 1, replication only synchronizes binlog events from the remote master of the current channel.
It also updates log_event.cc , altering the Log_event::do_shall_skip function to compare the event’s server_id with the remote master’s server_id ; mismatched events are skipped, preventing loop‑induced data conflicts.
5 Summary
The proposed MySQL data synchronization scheme enhances MySQL’s native log replication by introducing multi‑channel master‑master replication, reducing the complexity of disaster‑recovery drills and cut‑over. Each channel only syncs the nearest master’s binlog events, solving loop problems and supporting two‑region three‑center disaster recovery without third‑party HA components, keeping code changes under 100 lines and limiting risk to boundary nodes. The approach lowers failover time from 30 minutes to 5 minutes (or <2 minutes with automation), improves recovery success rate above 98 %, and can be extended to higher‑level disaster‑recovery scenarios.
Aikesheng Open Source Community
The Aikesheng Open Source Community provides stable, enterprise‑grade MySQL open‑source tools and services, releases a premium open‑source component each year (1024), and continuously operates and maintains them.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.