Why MySQL Replication Shows Zero Lag Yet Fails to Sync Changes
This article explains a rare MySQL replication scenario where Seconds_Behind_Master stays at zero while the slave cannot receive updates, analyzes the underlying push‑based replication mechanism, and offers practical steps and configuration tweaks to detect and prevent the issue.
Scenario
MySQL provides Seconds_Behind_Master via show slave status to measure replication delay, but we encountered a case where this value is 0 while the slave cannot receive changes from the master. The IO/SQL threads appear normal, yet the master’s updates are not replicated until MySQL automatically reconnects after about an hour. The issue affects all MySQL‑compatible versions (MySQL, Percona, MariaDB). Although rare, DBAs should be aware of this behavior to better understand MySQL replication retry mechanisms.
Reproduction Steps
Set up master‑slave replication, temporarily cut the master’s network, and kill the master’s binlog dump thread. Observe the slave with show slave status showing: Slave_IO_Running: Yes Slave_SQL_Running: Yes Seconds_Behind_Master: 0 After restoring the network, any change on the master is not propagated, even though the slave still reports IO and SQL threads as running and the delay as 0. Normal monitoring tools will not detect the problem.
Principle Analysis
MySQL replication is push‑based: the master pushes binlog events to the slave. The slave records the current binlog file and position, and the master starts a binlog dump thread to send events from that point. If the binlog dump thread is killed, the slave receives no events and assumes the master is idle, keeping Seconds_Behind_Master at 0. The slave cannot distinguish between a terminated thread and a period of no changes.
Cause Analysis
When the binlog dump thread is killed, the slave does not get a termination notification (due to network blockage or other reasons). Consequently, it continues to show normal status while actually missing all updates. MySQL tries to avoid this by retrying the connection: If no data is received for slave-net-timeout seconds, the slave initiates the first retry. Every master-connect-retry seconds it attempts to reconnect, up to master-retry-count times. Default values: slave-net-timeout =3600 s, master-connect-retry =60 s, master-retry-count =86400. Hence, in our test the slave waited about an hour before reconnecting.
Prevention Strategies
Passive handling: Adjust monitoring to detect this silent stall (e.g., use pt‑heartbeat or insert timestamps on the master and compare on the slave) and restart replication with stop slave; start slave; . Active prevention: Set appropriate replication retry parameters when configuring the slave: --master-retry-count --master-connect-retry --slave-net-timeout These can be specified in CHANGE MASTER (for the first two) or adjusted at runtime (for slave-net-timeout ). Reducing slave-net-timeout speeds up reconnection but may cause frequent reconnects if the master changes infrequently. Our monitoring tool (Q Monitor) uses a heartbeat approach similar to Percona’s pt‑heartbeat instead of relying on Seconds_Behind_Master for more reliable delay detection.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
