Understanding MySQL Semi‑Synchronous Replication and Master Switch Challenges
This article examines MySQL's semi‑synchronous replication, analyzes consistency scenarios across master‑slave failures and switches, and highlights the difficulties of data rollback and multi‑master conflicts that affect high‑availability clusters.
MySQL Overview
MySQL is a relational database management system (RDBMS) developed by MySQL AB and now owned by Oracle. Its small size, high speed, low cost, and open‑source nature have made it popular among major internet companies such as Tencent, Alibaba, Baidu, Google, and Facebook.
Importance of Data Disaster Recovery
With the rapid growth of the internet, service availability and data disaster recovery have become critical. In disaster recovery, ensuring data consistency across database clusters is a key challenge, especially for financial services that rely on MySQL as a core database.
Evolution of MySQL Replication
MySQL has progressed from asynchronous replication to Google‑developed semi‑synchronous replication, and finally to the lossless semi‑synchronous replication introduced in MySQL 5.7, which aims to improve cluster consistency.
Remaining Consistency Issues
Despite these improvements, consistency problems persist, leading companies to create custom patches such as Tencent's TDSQL, PhxSQL, Alibaba's AliSQL, and NetEase's InnoSQL. Although MySQL 5.7 claims “zero loss,” it does not fully resolve all consistency concerns.
MySQL Semi‑Synchronous Replication Issues
Figure 1 illustrates the binlog semi‑synchronous process. After the master sends the binlog to the slave, it must wait for the slave’s ACK before executing Engine Commit to persist data.
When MySQL starts, the Wait ACK step may be skipped, causing Engine Commit to run directly, which can lead to inconsistency.
Consistency Analysis Scenarios
The following analysis assumes that semi‑synchronous replication does not fall back to asynchronous mode.
Scenario 1 – Normal Master Operation
Master replicates data to the slave, and both remain consistent.
Scenario 2 – Master Crash Without Switch
2.1 Master Received ACK and Executed Engine Commit
Data has already been replicated to at least one slave, so consistency is maintained.
2.2 Master Crashed During Wait ACK (Pending Binlog)
After restart, the master executes Engine Commit and re‑replicates the binlog to the slave. In MySQL 5.7 this restores consistency; in MySQL 5.6 and earlier it does not.
Scenario 3 – Master Crash With Switch to New Master
3.1 Old Master Had ACK from at Least One Slave
Data is already present on a slave, keeping consistency.
3.2 Old Master Crashed During Wait ACK and a New Master Is Chosen
3.2.1 Binlog Send Failed (No Slave Received It)
When the old master restarts, it commits the pending binlog, causing data divergence between the old and new masters. Rolling back the pending binlog is required to restore consistency.
3.2.2 Binlog Sent Successfully but Not Yet Committed
In this case, the pending binlog is committed after restart, and the data on the old master becomes a subset of the new master’s data, allowing the old master to pull the latest data from the new master.
When both the old master and a slave fail simultaneously, a master switch can lead to data loss.
Cluster Size Considerations
For small clusters (≤ 3 nodes), the failure of two nodes makes the semi‑synchronous replication unusable. For larger clusters (> 3 nodes), the inability to determine which surviving node holds the latest data also prevents service continuity.
Increasing the number of required ACKs mitigates data loss but introduces rollback complexity.
Master Switch Procedure and Challenges
Pause the old master.
Start the new master.
Change MySQL client connections to point to the new master IP.
Key problems include handling an isolated master, locating the node with the latest binlog, performing data rollback, and synchronously updating client connection strings. In practice, simultaneous client updates are impossible, leading to a period where both old and new masters receive writes, which can cause inconsistency in a semi‑synchronous setup.
Summary
MySQL's semi‑synchronous replication and master‑switch processes both suffer from data rollback difficulties and multi‑master conflicts. Resolving these two major issues is essential for guaranteeing data consistency in MySQL clusters.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
WeChat Backend Team
Official account of the WeChat backend development team, sharing their experience in large-scale distributed system development.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
