Root Cause Analysis of Slave IO Thread Hang in MySQL Semi‑Sync Replication with Multiple Slaves
When using MySQL 5.7 with rpl_semi_sync_master_wait_for_slave_count=1, starting a second semi‑sync replica often causes the slave_io_thread to stall because the master cannot launch the corresponding dump thread, a bug that was later fixed in MySQL 5.7.23.
Problem description
In MySQL 5.7 (5.7.16, 5.7.17, 5.7.21) environments where multiple semi‑sync replicas are configured with rpl_semi_sync_master_wait_for_slave_count=1 , the first replica starts normally but the second one frequently leaves the Slave_IO_Running: Yes and Slave_SQL_Running: Yes states while the binlog is not synchronized.
Reproduction steps
Configure the master with: rpl_semi_sync_master_wait_for_slave_count = 1 rpl_semi_sync_master_wait_no_slave = OFF rpl_semi_sync_master_enabled = ON rpl_semi_sync_master_wait_point = AFTER_SYNC
Start semi‑sync replication on replica A ( start slave ) – replication works.
Start semi‑sync replication on replica B – the IO thread runs but the replica does not receive binlog updates.
Analysis process
After starting replica A, the master shows one semi‑sync client:
show global status like '%semi%';
+--------------------------------------------+-----------+
| Variable_name | Value |
+--------------------------------------------+-----------+
| Rpl_semi_sync_master_clients | 1 |
| ... | ... |
| Rpl_semi_sync_master_status | ON |
+--------------------------------------------+-----------+When replica B is started, the master still reports only one client, indicating that the second dump thread never starts.
Thread dumps from the master reveal three dump threads, but the two newly created ones remain in the starting state. The error log shows attempts to start these dump threads and messages about killing a “zombie” dump thread.
2018-05-25T11:31:59.586214+08:00 21847 [Note] Start binlog_dump to master_thread_id(21847) slave_server(873074711), pos(, 4)
2018-05-25T11:32:59.642278+08:00 21850 [Note] While initializing dump thread for slave with UUID
, found a zombie dump thread with the same UUID. Master is killing the zombie dump thread(21847).
2018-05-25T11:32:59.642452+08:00 21850 [Note] Start binlog_dump to master_thread_id(21850) slave_server(873074711), pos(, 4)Further investigation with gstack shows both the old and new dump threads waiting on the Ack_receiver lock, while thread 21875 holds the lock and is blocked in a select() call.
Thread 15 (Thread 0x7f0bce7fc700 (LWP 21875)):
#0 0x00007f0c028c9bd3 in select () from /lib64/libc.so.6
#1 0x00007f0be7589070 in Ack_receiver::run (this=0x7f0be778dae0 <ack_receiver>) at .../semisync_master_ack_receiver.cc:261
#2 0x00007f0be75893f9 in ack_receive_handler (arg=0x7f0be778dae0 <ack_receiver>) at .../semisync_master_ack_receiver.cc:34
#3 0x00000000011cf5f4 in pfs_spawn_thread (arg=0x2d68f00) at .../pfs.cc:2188
#4 0x00007f0c03c08dc5 in start_thread () from /lib64/libpthread.so.0
#5 0x00007f0c028d276d in clone () from /lib64/libc.so.6The problematic code in semisync_master_ack_receiver.cc continuously locks m_mutex , performs a select() , and unlocks, without yielding the CPU, causing other threads to starve:
void Ack_receiver::run() {
while (1) {
mysql_mutex_lock(&m_mutex);
...
select(...);
...
mysql_mutex_unlock(&m_mutex);
}
}Adding a sched_yield(); after mysql_mutex_unlock alleviates the issue.
Conclusion
The slave_io_thread stall is caused by the master’s dump thread failing to start because it cannot acquire the Ack_receiver lock.
When rpl_semi_sync_master_wait_for_slave_count=1 , the first replica’s ack_receiver holds the lock continuously after the required ACK count is reached, preventing the second dump thread from launching.
The bug was reported to MySQL (bug 89370) and a patch was submitted; MySQL confirmed a fix in version 5.7.23, though the official fix differs from the community patch.
For environments that cannot upgrade immediately, reducing the probability of occurrence may involve adjusting the wait‑for‑slave count or applying the community patch manually.
Aikesheng Open Source Community
The Aikesheng Open Source Community provides stable, enterprise‑grade MySQL open‑source tools and services, releases a premium open‑source component each year (1024), and continuously operates and maintains them.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.