Analysis of Inconsistencies in MySQL Slave Crash Recovery and Sync Master/Relay Log Info
This article analyzes how MySQL slave crashes can cause inconsistencies between master info and relay log positions during sync_master_info and sync_relay_log_info processes, presents case studies of duplicated events, explains the impact of different storage repositories (FILE vs TABLE), and recommends configuration settings to achieve server‑crash‑safe replication.
When a MySQL slave crashes and restarts, the replication may report errors such as 1062 or 1032. The root cause is often that some binlog events are duplicated because the slave's IO thread persisted events to the relay log while the master‑info or relay‑log‑info records were not yet synchronized.
Replication relies on two pieces of persistent state: master info (the position of the relay log persisted by the slave IO thread) and relay log info (the position of the relay log already applied by the slave SQL thread). Their storage format is controlled by master_info_repository and relay_log_info_repository , while the synchronization frequency is governed by sync_master_info and sync_relay_log_info .
2.1 sync_master_info inconsistency
The purpose of sync_master_info is to write the current relay‑log position into the master‑info file or table. A typical default configuration is:
# 默认配置
relay_log_recovery = OFF
sync_master_info=10000
sync_relay_log_info=10000
slave_parallel_type=DATABASEThe call stack for each received event looks like:
# stack‑sync_master_info
/-- handle_slave_io # executed for every event
/-- flush_master_info
/-- Relay_log_info::flush_current_log # writes event to relay log
/-- Master_info::flush_info
/-- Rpl_info_handler::flush_info # persists position to master infoIf a crash occurs between writing the relay log and persisting the position, the master‑info may miss the last event position while the relay log already contains that event, leading to duplicate processing after restart.
2.1.1 master_info_repository = FILE
When master_info_repository=FILE , the slave stores the position in master.info . The flush sequence is:
/-- Rpl_info_handler::flush_info
/-- Rpl_info_file::do_flush_info
/-- flush_io_cache # writes master.info but does not sync
/-- my_sync # performed every
sync_master_info
callsIf a crash happens after my_b_flush_io_cache but before my_sync , the file may contain unwritten data, causing the master‑info to lag behind the relay log. This manifests as an event that exists in the relay log but not in master‑info, so the slave will re‑pull that event and duplicate it.
Two concrete cases were observed:
Case ID
Events before crash
master‑info position at crash
Relay log persisted
Re‑pull start point
Replication success
Case 1
130
master‑bin.0000001:5479
master‑bin.0000001:0~5524
master‑bin.0000001:5479
Y
Case 2
131
master‑bin.0000001:5524
master‑bin.0000001:0~5564
master‑bin.0000001:5524
Y
Both cases resulted in a duplicated event range (5479‑5524). In Case 1 the duplicated event was a Table_map without any row‑changing event, so it caused no error. In Case 2 a Write_rows event was duplicated, which could trigger a duplicate‑key error (1062) if the row already existed.
2.1.2 master_info_repository = TABLE
When master_info_repository=TABLE , the slave writes to mysql.slave_master_info . The flush flow is:
/-- Rpl_info_handler::flush_info
/-- Rpl_info_table::do_flush_info
/-- handler::ha_update_row # updates the table every
sync_master_info
eventsOnly after sync_master_info events are processed does the table get updated. Consequently, if the slave has processed 11 131 events but sync_master_info is 10 000, the table will record only up to event 10 000. After a restart the slave will start from event 10 001, causing the events 10 001‑11 131 to be applied twice, which may lead to duplicate‑key (1062) or not‑found (1032) errors.
2.2 sync_relay_log_info inconsistency
The relay_log_info_repository can also be FILE or TABLE. When set to FILE, the flush sequence for each transaction commit is:
/-- Xid_apply_log_event::do_apply_event
/-- Xid_log_event::do_commit
/-- Relay_log_info::flush_info
/-- Rpl_info_handler::flush_info
/-- Rpl_info_file::do_flush_info
/-- flush_io_cache # writes file without sync
/-- my_sync # executed every
sync_relay_log_info
callsIf a crash occurs between Xid_log_event::do_commit and my_b_flush_io_cache , the relay‑log‑info may lag behind the actual applied position, causing the same transaction to be re‑executed after restart, again producing duplicate‑key (1062) or not‑found (1032) errors.
Conclusion and Recommendations
Inconsistent master‑info or relay‑log‑info positions can cause events or whole transactions to be replayed, leading to duplicate‑key (1062) or record‑not‑found (1032) errors. To achieve a server‑crash‑safe replication setup, the following configurations are recommended:
For non‑GTID mode: set relay_log_info_repository = TABLE and enable relay_log_recovery = ON .
For GTID mode: enable MASTER_AUTO_POSITION = ON and relay_log_recovery = ON , and use the GTID‑based binlog dump protocol.
Tencent Architect
We share insights on storage, computing, networking and explore leading industry technologies together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.