Databases 8 min read

Root Cause Analysis of Slave IO Thread Hang in MySQL Semi‑Sync Replication with rpl_semi_sync_master_wait_for_slave_count=1

An in‑depth investigation reveals that when MySQL semi‑sync replication is configured with rpl_semi_sync_master_wait_for_slave_count=1, starting a second slave can cause the master’s dump thread to fail, leading to slave_io_thread stalls; the analysis includes reproduction steps, status checks, thread stack traces, and a patch using sched_yield to resolve the lock contention.

Aikesheng Open Source Community
Aikesheng Open Source Community
Aikesheng Open Source Community
Root Cause Analysis of Slave IO Thread Hang in MySQL Semi‑Sync Replication with rpl_semi_sync_master_wait_for_slave_count=1

Preface: This article is part of the MySQL column series from the Aikesheng operations team, sharing practical experience on MySQL features, optimization, architecture, HA, monitoring, etc.

Problem description : In environments with multiple semi‑sync replicas, setting rpl_semi_sync_master_wait_for_slave_count=1 allows the first replica to start normally, but starting a second replica often causes the slave_io_thread to hang even though Slave_IO_Running and Slave_SQL_Running are Yes and the binlog is not synchronized.

Reproduction steps :

1. Configure the master with the following parameters:

rpl_semi_sync_master_wait_for_slave_count = 1
rpl_semi_sync_master_wait_no_slave = OFF
rpl_semi_sync_master_enabled = ON
rpl_semi_sync_master_wait_point = AFTER_SYNC

2. Start semi‑sync replication on slave A and verify normal replication.

3. Start semi‑sync replication on slave B; the replication thread runs but does not sync the master binlog.

Analysis process :

Check the master’s semi‑sync status after slave A starts:

show global status like '%semi%';
+--------------------------------------------+-----------+
| Variable_name                              | Value     |
+--------------------------------------------+-----------+
| Rpl_semi_sync_master_clients                | 1         |
| Rpl_semi_sync_master_status                | ON       |
+--------------------------------------------+-----------+

Inspect the master’s dump thread via performance_schema:

select * from performance_schema.threads where PROCESSLIST_COMMAND='Binlog Dump GTID'\G

Review the master’s error log, which shows the dump thread (21824) starting successfully and the semi‑sync replication being switched ON.

After starting slave B, the master still reports only one semi‑sync client, and a new dump thread fails to start, leaving the slave_io_thread stalled.

Thread stack traces (gstack) reveal that both the existing and new dump threads are waiting on the Ack_receiver lock, while thread 21875 holds the lock and blocks on select().

Thread 15 (Thread 0x7f0bce7fc700 (LWP 21875)):
#0 0x00007f0c028c9bd3 in select () from /lib64/libc.so.6
#1 0x00007f0be7589070 in Ack_receiver::run (this=0x7f0be778dae0
) at .../semisync_master_ack_receiver.cc:261
#2 0x00007f0be75893f9 in ack_receive_handler (arg=0x7f0be778dae0
) at .../semisync_master_ack_receiver.cc:34
#3 0x00000000011cf5f4 in pfs_spawn_thread (arg=0x2d68f00) at .../pfs.cc:2188
#4 0x00007f0c03c08dc5 in start_thread () from /lib64/libpthread.so.0
#5 0x00007f0c028d276d in clone () from /lib64/libc.so.6

The Ack_receiver loop holds the mutex while repeatedly checking the ack count, never yielding, which prevents the new dump thread from acquiring the lock.

Proposed fix: add sched_yield(); after mysql_mutex_unlock(&m_mutex); in Ack_receiver::run() so other threads get scheduling opportunities. This change eliminates the hang.

Conclusion :

The root cause of the slave_io_thread stall is the master’s dump thread failing to start due to lock contention in Ack_receiver.

When rpl_semi_sync_master_wait_for_slave_count=1, the first semi‑sync slave causes Ack_receiver to constantly hold the lock, blocking the creation of a second dump thread.

A bug report (MySQL bug #89370) and a patch were submitted; the issue was fixed in MySQL 5.7.23.

For users unable to upgrade, the article suggests using sched_yield() as a temporary workaround and invites discussion in the community group.

performanceMySQLLock ContentionBug FixSemi-sync replicationDump thread
Aikesheng Open Source Community
Written by

Aikesheng Open Source Community

The Aikesheng Open Source Community provides stable, enterprise‑grade MySQL open‑source tools and services, releases a premium open‑source component each year (1024), and continuously operates and maintains them.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.