Databases 12 min read

Understanding MySQL Semi‑Synchronous Replication and Master Switch Challenges

This article examines MySQL's semi‑synchronous replication, analyzes consistency scenarios across master‑slave failures and switches, and highlights the difficulties of data rollback and multi‑master conflicts that affect high‑availability clusters.

WeChat Backend Team
WeChat Backend Team
WeChat Backend Team
Understanding MySQL Semi‑Synchronous Replication and Master Switch Challenges

MySQL Overview

MySQL is a relational database management system (RDBMS) developed by MySQL AB and now owned by Oracle. Its small size, high speed, low cost, and open‑source nature have made it popular among major internet companies such as Tencent, Alibaba, Baidu, Google, and Facebook.

Importance of Data Disaster Recovery

With the rapid growth of the internet, service availability and data disaster recovery have become critical. In disaster recovery, ensuring data consistency across database clusters is a key challenge, especially for financial services that rely on MySQL as a core database.

Evolution of MySQL Replication

MySQL has progressed from asynchronous replication to Google‑developed semi‑synchronous replication, and finally to the lossless semi‑synchronous replication introduced in MySQL 5.7, which aims to improve cluster consistency.

Remaining Consistency Issues

Despite these improvements, consistency problems persist, leading companies to create custom patches such as Tencent's TDSQL, PhxSQL, Alibaba's AliSQL, and NetEase's InnoSQL. Although MySQL 5.7 claims “zero loss,” it does not fully resolve all consistency concerns.

MySQL Semi‑Synchronous Replication Issues

Figure 1: MySQL semi‑synchronous flow
Figure 1: MySQL semi‑synchronous flow

Figure 1 illustrates the binlog semi‑synchronous process. After the master sends the binlog to the slave, it must wait for the slave’s ACK before executing Engine Commit to persist data.

When MySQL starts, the Wait ACK step may be skipped, causing Engine Commit to run directly, which can lead to inconsistency.

Consistency Analysis Scenarios

The following analysis assumes that semi‑synchronous replication does not fall back to asynchronous mode.

Scenario 1 – Normal Master Operation

Master replicates data to the slave, and both remain consistent.

Scenario 2 – Master Crash Without Switch

2.1 Master Received ACK and Executed Engine Commit

Data has already been replicated to at least one slave, so consistency is maintained.

2.2 Master Crashed During Wait ACK (Pending Binlog)

After restart, the master executes Engine Commit and re‑replicates the binlog to the slave. In MySQL 5.7 this restores consistency; in MySQL 5.6 and earlier it does not.

Figure 2: Engine Commit after master restart
Figure 2: Engine Commit after master restart

Scenario 3 – Master Crash With Switch to New Master

3.1 Old Master Had ACK from at Least One Slave

Data is already present on a slave, keeping consistency.

3.2 Old Master Crashed During Wait ACK and a New Master Is Chosen

3.2.1 Binlog Send Failed (No Slave Received It)

Figure 3: Inconsistent data after restart
Figure 3: Inconsistent data after restart
Figure 4: Retry transaction X
Figure 4: Retry transaction X

When the old master restarts, it commits the pending binlog, causing data divergence between the old and new masters. Rolling back the pending binlog is required to restore consistency.

3.2.2 Binlog Sent Successfully but Not Yet Committed

Figure 6: Immediate Engine Commit after restart
Figure 6: Immediate Engine Commit after restart

In this case, the pending binlog is committed after restart, and the data on the old master becomes a subset of the new master’s data, allowing the old master to pull the latest data from the new master.

Figure 7: Data loss when two machines fail
Figure 7: Data loss when two machines fail

When both the old master and a slave fail simultaneously, a master switch can lead to data loss.

Cluster Size Considerations

For small clusters (≤ 3 nodes), the failure of two nodes makes the semi‑synchronous replication unusable. For larger clusters (> 3 nodes), the inability to determine which surviving node holds the latest data also prevents service continuity.

Increasing the number of required ACKs mitigates data loss but introduces rollback complexity.

Figure 8: ACK count impact
Figure 8: ACK count impact

Master Switch Procedure and Challenges

Pause the old master.

Start the new master.

Change MySQL client connections to point to the new master IP.

Key problems include handling an isolated master, locating the node with the latest binlog, performing data rollback, and synchronously updating client connection strings. In practice, simultaneous client updates are impossible, leading to a period where both old and new masters receive writes, which can cause inconsistency in a semi‑synchronous setup.

Figure 10: Multi‑master write conflict
Figure 10: Multi‑master write conflict

Summary

MySQL's semi‑synchronous replication and master‑switch processes both suffer from data rollback difficulties and multi‑master conflicts. Resolving these two major issues is essential for guaranteeing data consistency in MySQL clusters.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

high availabilityData ConsistencyMySQLdisaster recoverySemi‑synchronous Replicationmaster switch
WeChat Backend Team
Written by

WeChat Backend Team

Official account of the WeChat backend development team, sharing their experience in large-scale distributed system development.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.