Analysis of Inconsistencies in MySQL Replication after Slave Crash and Configuration Recommendations
This article examines how slave crashes in MySQL replication can cause inconsistencies between master info and relay log positions, leading to duplicate events and errors such as 1062 and 1032, analyzes underlying mechanisms, presents case studies, and proposes configuration settings to achieve server‑crash‑safe replication.
Part 1 – Background
When a MySQL slave crashes due to configuration issues, the subsequent re‑establishment of replication may raise error 1062 or 1032. Local reproduction shows that some binlog events are duplicated before and after the crash, suggesting that the IO thread may have copied certain events twice.
Two pieces of state are relevant to replication: master info , which stores the persisted relay‑log position of the slave IO thread, and relay‑log info , which stores the position that the slave SQL thread has already applied. Their storage format is controlled by master_info_repository and relay_log_info_repository , while the sync frequency depends on sync_master_info and sync_relay_log_info .
Stored Information
master info
relay log info
Stored position information
Relay‑log position persisted by the slave IO thread
Relay‑log position already applied by the slave SQL thread
Storage‑format control variable
master_info_repository relay_log_info_repositorySync‑frequency variable
sync_master_info sync_relay_log_infoThe article focuses on two inconsistency scenarios that may appear after a slave crash and the problems they cause for master‑slave synchronization:
During sync_master_info , the persisted relay‑log position in master info may diverge from the actual relay‑log position.
During sync_relay_log_info , the position recorded in relay‑log info may diverge from the position actually applied in the relay log.
Part 2 – Analysis
After a slave crash, the mismatch of state information can be traced to MySQL’s non‑transactional operations: sync_master_info and sync_relay_log_info . The following sections analyze each operation.
2.1 sync_master_info inconsistency
Default configuration:
# default configuration
relay_log_recovery = OFF
sync_master_info=10000
sync_relay_log_info=10000
slave_parallel_type=DATABASECall stack for sync_master_info :
# stack‑sync_master_info
/-- handle_slave_io # executed for each received event
/-- flush_master_info
/-- Relay_log_info::flush_current_log # writes event to relay log file
/-- Master_info::flush_info
/-- Rpl_info_handler::flush_info # persists the current position to master infoIf a crash occurs between writing the relay‑log file ( my_b_flush_io_cache ) and the subsequent my_sync , the master info may miss the last position while the relay log already contains the event, causing the slave to re‑pull the same event after restart.
2.1.1 master_info_repository = FILE
When master_info_repository=FILE , the slave stores the position in master.info under the data directory. The flush sequence is:
/-- Rpl_info_handler::flush_info
/-- Rpl_info_file::do_flush_info
/-- flush_io_cache
/-- my_b_flush_io_cache # writes master.info but does not sync
/-- my_sync # executed every
sync_master_info
callsA crash between my_b_flush_io_cache and my_sync leaves the file unsynced; a crash between Relay_log_info::flush_current_log and my_b_flush_io_cache can cause the relay log to contain an event that master info does not record, leading to duplicate events after restart.
Two concrete cases observed:
Case ID
Events before crash
master.info position at crash
Relay log persisted at crash
Binlog start point after restart
Replication success
Case 1
130
master-bin.0000001:5479
master-bin.0000001:0~5524
master-bin.0000001:5479
Y
Case 2
131
master-bin.0000001:5524
master-bin.0000001:0~5564
master-bin.0000001:5524
Y
Both cases cause the slave to rotate a new relay‑log file after re‑establishment, and because the start point is earlier than the last persisted position, an event range (e.g., master-bin.0000001:5479~5524 ) is duplicated. In Case 1 the duplicated event is a Table_map_event without any row‑changing event, so it does not affect data. In Case 2 the duplicated event is a Write_rows_event , but the corresponding Table_map_event is missing in the new relay log, so the row‑change is also ignored.
#210220 14:18:31 server id 1 end_log_pos 5609 CRC32 0x808ed9e0 Table_map: `test`.`t1` mapped to number 221 #210220 14:18:31 server id 1 end_log_pos 5564 CRC32 0x44309595 Write_rows: table id 221 flags: STMT_END_FIf such duplicated events belong to an UPDATE or DELETE transaction, the missing Table_map_event can cause the transaction to be partially applied, leading to error 1032 (record not found) or 1062 (duplicate entry).
2.1.2 master_info_repository = TABLE
When master_info_repository=TABLE , the slave persists the position in the mysql.slave_master_info table. The flush flow is:
/-- Rpl_info_handler::flush_info
/-- Rpl_info_table::do_flush_info
/-- handler::ha_update_row # updates the table; executed every
sync_master_info
(default 10000) eventsOnly after sync_master_info events are processed does the table get updated, so events between two sync points are not recorded. After a restart, the slave resumes from the last recorded position, potentially re‑processing a large block of events and causing duplicate execution.
In GTID mode with MASTER_AUTO_POSITION enabled, the slave uses GTID sets (stored in InnoDB tables) instead of master info, which are updated atomically and avoid the above inconsistency.
2.2 sync_relay_log_info inconsistency
When relay_log_info_repository=FILE , the flush sequence during a transaction commit is:
/-- Xid_apply_log_event::do_apply_event # called for each XID (COMMIT)
/-- Xid_log_event::do_commit
/-- Relay_log_info::flush_info
/-- Rpl_info_handler::flush_info
/-- Rpl_info_file::do_flush_info
/-- flush_io_cache
/-- my_b_flush_io_cache # writes file but does not sync
/-- my_sync # executed every
sync_relay_log_info
(default 10000) callsIf a crash occurs between Xid_log_event::do_commit and my_b_flush_io_cache , the relay‑log info may not reflect the committed transaction, causing the same transaction to be applied again after restart, which can also trigger 1062/1032 errors.
master-bin.0000001:120 Query (begin) # trx1
...
master-bin.0000001:150 Xid (commit)
master-bin.0000001:160 Query (begin) # trx2
...
master-bin.0000001:190 Xid (commit)If the crash happens after the commit of trx2 but before the flush, the slave will re‑apply trx2 on restart, resulting in duplicate execution.
Part 3 – Conclusion
Under certain configurations, the mismatches between master‑info positions, relay‑log persistence, and relay‑log‑info positions can cause events or whole transactions to be executed twice, leading to duplicate‑key errors (1062) or missing‑record errors (1032).
To achieve a server‑crash‑safe replication setup, the following configuration recommendations are provided:
Non‑GTID mode: enable relay_log_info_repository=TABLE and relay_log_recovery=ON .
GTID mode: enable MASTER_AUTO_POSITION=ON and relay_log_recovery=ON (the COM_BINLOG_DUMP_GTID protocol ensures safety).
Tencent Database Technology Team supports internal services such as QQ Space, WeChat Red Packets, Tencent Ads, Tencent Music, Tencent News, and external products on Tencent Cloud like TDSQL‑C (formerly CynosDB), TencentDB for MySQL (CDB), CTSDB, MongoDB, CES, etc. This public account aims to share professional database knowledge with enthusiasts.
Tencent Database Technology
Tencent's Database R&D team supports internal services such as WeChat Pay, WeChat Red Packets, Tencent Advertising, and Tencent Music, and provides external support on Tencent Cloud for TencentDB products like CynosDB, CDB, and TDSQL. This public account aims to promote and share professional database knowledge, growing together with database enthusiasts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.