Understanding Data Consistency in MySQL Semi‑Synchronous Replication and HA Failover
This article explains the principles of MySQL semi‑synchronous replication, analyzes how data consistency is maintained during high‑availability failover, presents detailed step‑by‑step transaction flow, discusses scenarios causing GTID divergence, and offers testing methods and remediation techniques for DBA practitioners.
Understanding MySQL Semi‑Synchronous Replication
MySQL 5.7 enables semi‑synchronous replication by default. During a transaction commit, the master writes the binlog and must receive an ACK from at least one slave before proceeding. If no ACK arrives within the timeout, the system falls back to asynchronous replication.
Configuration for Reliable Semi‑Sync
sync_binlog=1
innodb_flush_log_at_trx_commit=1
...(etc.)The author asserts that these settings provide the most reliable semi‑sync configuration.
Key Terminology
Terms such as lossless semi‑sync , enhanced semi‑sync , and the parameter rpl_semi_sync_master_wait_point=AFTER_SYNC all refer to the same mode, which avoids data loss after a high‑availability switch.
Potential Inconsistency Scenarios
Two main cases can cause data divergence after a master‑slave switch:
Old master retains more GTIDs than the new master (the typical case discussed).
New master may have more GTIDs if sync_binlog is not set to 1, due to unflushed binlog entries or timing differences.
The article breaks down the replication process into phases A and B, with sub‑phases 2aa and 2ab, illustrating how GTID gaps arise.
Testing Methodology
Set up a one‑master‑one‑slave semi‑sync cluster.
Run sysbench to generate load (up to 800 TPS after tuning).
Kill the master process with kill -9 mysqld .
Prevent automatic restart, then compare GTID sets on master and slave.
The test eventually reproduced a situation where the old master had three extra GTIDs.
# mysql -uadmin -pGta@2019 -S /database/mysql/data/3306/mysqld.sock -e "show slave status\G" | grep "ffc43852-1d82-11ed-a65f-000c29375703"
Master_UUID: ffc43852-1d82-11ed-a65f-000c29375703
Retrieved_Gtid_Set: ffc43852-1d82-11ed-a65f-000c29375703:210837-283030
Executed_Gtid_Set: ffc43852-1d82-11ed-a65f-000c29375703:1-283030Parsing the master binlog revealed three additional GTID statements highlighted in red.
# cat 16.txt | grep GTID | grep "ffc43852-1d82-11ed-a65f-000c29375703:28303"
SET @@SESSION.GTID_NEXT='ffc43852-1d82-11ed-a65f-000c29375703:283030';
SET @@SESSION.GTID_NEXT='ffc43852-1d82-11ed-a65f-000c29375703:283031';
SET @@SESSION.GTID_NEXT='ffc43852-1d82-11ed-a65f-000c29375703:283032';
SET @@SESSION.GTID_NEXT='ffc43852-1d82-11ed-a65f-000c29375703:283033';Why the Scenario Is Hard to Simulate
The 2aa window is very short; high TPS and slow I/O (e.g., using a deliberately slow disk for the binlog) increase the chance of reproducing the issue.
Repair Strategies
Restart before failover: If the master restarts quickly, it can catch up without a switch, keeping GTIDs synchronized.
Catch‑up after failover: Use MHA or similar tools to let the new master pull missing binlogs from the old master.
Flashback the old master: Roll back extra GTIDs on the old master before re‑adding it as a replica.
Both “slave catch‑up” and “master rollback” are valid; the choice depends on the operational context.
Conclusion
Under lossless semi‑synchronous replication, business‑level data appears consistent after a high‑availability switch, but underlying binlog/GTID differences can exist. DBAs must understand the replication phases, be able to reproduce the edge cases, and apply appropriate remediation to ensure true data consistency.
Aikesheng Open Source Community
The Aikesheng Open Source Community provides stable, enterprise‑grade MySQL open‑source tools and services, releases a premium open‑source component each year (1024), and continuously operates and maintains them.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.