Databases 15 min read

Analysis of Inconsistencies in MySQL Slave Crash Recovery and Sync Master/Relay Log Info

This article analyzes how MySQL slave crashes can cause inconsistencies between master info and relay log positions during sync_master_info and sync_relay_log_info processes, presents case studies of duplicated events, explains the impact of different storage repositories (FILE vs TABLE), and recommends configuration settings to achieve server‑crash‑safe replication.

Tencent Architect
Tencent Architect
Tencent Architect
Analysis of Inconsistencies in MySQL Slave Crash Recovery and Sync Master/Relay Log Info

When a MySQL slave crashes and restarts, the replication may report errors such as 1062 or 1032. The root cause is often that some binlog events are duplicated because the slave's IO thread persisted events to the relay log while the master‑info or relay‑log‑info records were not yet synchronized.

Replication relies on two pieces of persistent state: master info (the position of the relay log persisted by the slave IO thread) and relay log info (the position of the relay log already applied by the slave SQL thread). Their storage format is controlled by master_info_repository and relay_log_info_repository , while the synchronization frequency is governed by sync_master_info and sync_relay_log_info .

2.1 sync_master_info inconsistency

The purpose of sync_master_info is to write the current relay‑log position into the master‑info file or table. A typical default configuration is:

# 默认配置
relay_log_recovery = OFF
sync_master_info=10000
sync_relay_log_info=10000
slave_parallel_type=DATABASE

The call stack for each received event looks like:

# stack‑sync_master_info
/-- handle_slave_io   # executed for every event
    /-- flush_master_info
        /-- Relay_log_info::flush_current_log   # writes event to relay log
        /-- Master_info::flush_info
            /-- Rpl_info_handler::flush_info   # persists position to master info

If a crash occurs between writing the relay log and persisting the position, the master‑info may miss the last event position while the relay log already contains that event, leading to duplicate processing after restart.

2.1.1 master_info_repository = FILE

When master_info_repository=FILE , the slave stores the position in master.info . The flush sequence is:

/-- Rpl_info_handler::flush_info
    /-- Rpl_info_file::do_flush_info
        /-- flush_io_cache   # writes master.info but does not sync
        /-- my_sync          # performed every
sync_master_info
calls

If a crash happens after my_b_flush_io_cache but before my_sync , the file may contain unwritten data, causing the master‑info to lag behind the relay log. This manifests as an event that exists in the relay log but not in master‑info, so the slave will re‑pull that event and duplicate it.

Two concrete cases were observed:

Case ID

Events before crash

master‑info position at crash

Relay log persisted

Re‑pull start point

Replication success

Case 1

130

master‑bin.0000001:5479

master‑bin.0000001:0~5524

master‑bin.0000001:5479

Y

Case 2

131

master‑bin.0000001:5524

master‑bin.0000001:0~5564

master‑bin.0000001:5524

Y

Both cases resulted in a duplicated event range (5479‑5524). In Case 1 the duplicated event was a Table_map without any row‑changing event, so it caused no error. In Case 2 a Write_rows event was duplicated, which could trigger a duplicate‑key error (1062) if the row already existed.

2.1.2 master_info_repository = TABLE

When master_info_repository=TABLE , the slave writes to mysql.slave_master_info . The flush flow is:

/-- Rpl_info_handler::flush_info
    /-- Rpl_info_table::do_flush_info
        /-- handler::ha_update_row   # updates the table every
sync_master_info
events

Only after sync_master_info events are processed does the table get updated. Consequently, if the slave has processed 11 131 events but sync_master_info is 10 000, the table will record only up to event 10 000. After a restart the slave will start from event 10 001, causing the events 10 001‑11 131 to be applied twice, which may lead to duplicate‑key (1062) or not‑found (1032) errors.

2.2 sync_relay_log_info inconsistency

The relay_log_info_repository can also be FILE or TABLE. When set to FILE, the flush sequence for each transaction commit is:

/-- Xid_apply_log_event::do_apply_event
    /-- Xid_log_event::do_commit
        /-- Relay_log_info::flush_info
            /-- Rpl_info_handler::flush_info
                /-- Rpl_info_file::do_flush_info
                    /-- flush_io_cache   # writes file without sync
                    /-- my_sync          # executed every
sync_relay_log_info
calls

If a crash occurs between Xid_log_event::do_commit and my_b_flush_io_cache , the relay‑log‑info may lag behind the actual applied position, causing the same transaction to be re‑executed after restart, again producing duplicate‑key (1062) or not‑found (1032) errors.

Conclusion and Recommendations

Inconsistent master‑info or relay‑log‑info positions can cause events or whole transactions to be replayed, leading to duplicate‑key (1062) or record‑not‑found (1032) errors. To achieve a server‑crash‑safe replication setup, the following configurations are recommended:

For non‑GTID mode: set relay_log_info_repository = TABLE and enable relay_log_recovery = ON .

For GTID mode: enable MASTER_AUTO_POSITION = ON and relay_log_recovery = ON , and use the GTID‑based binlog dump protocol.

MySQLReplicationRelay Logdatabase consistencySlave CrashSync Master Info
Tencent Architect
Written by

Tencent Architect

We share insights on storage, computing, networking and explore leading industry technologies together.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.