Root Cause Analysis of MySQL Replica Synchronization Hang and Mitigation Recommendations
This article investigates a MySQL 8.0.27 replica that repeatedly hangs during show replica status, analyzes worker thread states, InnoDB status, checkpoint and I/O load, identifies an oversized Redo Log usage as the root cause, and provides configuration and upgrade suggestions to resolve the issue.
Recently a MySQL replica (version 8.0.27, master‑slave architecture) began reporting synchronization errors; the show replica status command was suspended along with flush logs . Important configuration details include binlog_transaction_dependency_tracking=WRITESET and replica_parallel_workers=16 .
1 Preliminary Analysis
1.1 Connection Situation
Sixteen worker threads were observed: 4 in Waiting for preceding transaction to commit , 11 in Applying batch of row changes , and 1 in Executing event . The total wait time reached about 38 hours, and both show replica status and flush logs remained blocked.
1.2 InnoDB Status Output
The InnoDB status showed no obvious issues such as mis‑configured innodb_thread_concurrency . The focus shifted to row operations and file I/O.
1.3 Load Situation
The ib_log_checkpt thread consumed 100% CPU while other threads were idle.
1.4 Error Log
No relevant error messages were found.
1.5 Slow Query Log
No slow queries were recorded.
1.6 Preliminary Conclusion
Although MySQL 8.0 has a known MTS bug (Bug 103636) that can cause workers to wait indefinitely, the observed thread states do not match the bug’s typical pattern, so the bug is unlikely the cause.
2 Source‑Code‑Based Deep Dive
The MTS workflow was traced through the source files sql/rpl_replica.cc , sql/log_event.cc , and related commit‑order manager code. Key steps include allocation of a commit order manager, registration of transactions, waiting on MDL locks in ordered_commit , and the eventual release of locks after commit.
// STEP‑1: allocate commit_order_manager if replica_preserve_commit_order is on
if (opt_replica_preserve_commit_order && !rli->is_parallel_exec() &&
rli->opt_replica_parallel_workers > 1) {
commit_order_mngr = new Commit_order_manager(rli->opt_replica_parallel_workers);
rli->set_commit_order_manager(commit_order_mngr);
}
// ... later in ordered_commit the worker may wait on MDL lock
Commit_order_manager::wait(...);Workers are classified as:
A class: 4 workers in Waiting for preceding transaction to commit
B class: 11 workers in Applying batch of row changes
C class: 1 worker in Executing event
Argument 1
If Bug 103636 were responsible, all workers would be stuck in Waiting for preceding transaction to commit , which is not the case.
Argument 2
It is more plausible that other factors prevent B and C workers from committing, causing A workers to wait for a smaller commit sequence number.
3 Root‑Cause Analysis
3.1 ib_log_checkpt Stack Analysis
Top, stack traces, and perf reports show the checkpoint thread repeatedly executing dirty page flush via buf_pool_get_oldest_modification_approx , suggesting a bottleneck in locating the oldest dirty page.
3.2 System I/O Load
IO monitoring images indicate that overall disk I/O is not saturated, contradicting the hypothesis of slow async IO.
3.3 Re‑examination of InnoDB Status
The redo log usage was calculated: LSN difference = 2.68 GB, while the total redo log size (3 × 1 GB) is 3 GB, resulting in an 89 % utilization—well above the recommended 75 % threshold.
4 Problem Summary and Recommendations
4.1 Summary
The root cause is an undersized Redo Log configuration; during the incident the replica’s redo usage exceeded safe limits, preventing checkpoints from completing and causing worker threads to block on MDL locks in ordered commit.
4.2 Suggested Solutions
Increase the size of Redo Log files.
Upgrade to the latest MySQL 8.0 release to obtain fixes for Bug 103636 and related issues.
Enlarge innodb_buffer_pool_size to provide more memory for InnoDB.
4.3 Data‑Collection Guidance for Complex Issues
A comprehensive set of diagnostic commands is provided, covering process lists, global variables, status, slave status, performance schema threads, InnoDB transactions, mutexes, data locks, stack traces, and perf recordings. The commands are wrapped in pre blocks to preserve formatting.
su - mysql
currdt=`date +%Y%m%d_%H%M%S`
mkdir /tmp/diag_info_`hostname -i`_$currdt
# ... (rest of the script as in the original article) ...These steps help reproduce the environment, capture relevant metrics, and facilitate root‑cause identification for similar MySQL replication stalls.
Aikesheng Open Source Community
The Aikesheng Open Source Community provides stable, enterprise‑grade MySQL open‑source tools and services, releases a premium open‑source component each year (1024), and continuously operates and maintains them.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.