Databases 14 min read

Root Cause Analysis of MySQL Replica Synchronization Hang and Mitigation Recommendations

This article investigates a MySQL 8.0.27 replica that repeatedly hangs during show replica status, analyzes worker thread states, InnoDB status, checkpoint and I/O load, identifies an oversized Redo Log usage as the root cause, and provides configuration and upgrade suggestions to resolve the issue.

Aikesheng Open Source Community

Apr 2, 2024

Root Cause Analysis of MySQL Replica Synchronization Hang and Mitigation Recommendations

Recently a MySQL replica (version 8.0.27, master‑slave architecture) began reporting synchronization errors; the show replica status command was suspended along with flush logs. Important configuration details include binlog_transaction_dependency_tracking=WRITESET and replica_parallel_workers=16.

1 Preliminary Analysis

1.1 Connection Situation

Sixteen worker threads were observed: 4 in Waiting for preceding transaction to commit , 11 in Applying batch of row changes , and 1 in Executing event . The total wait time reached about 38 hours, and both show replica status and flush logs remained blocked.

1.2 InnoDB Status Output

The InnoDB status showed no obvious issues such as mis‑configured innodb_thread_concurrency. The focus shifted to row operations and file I/O.

1.3 Load Situation

The ib_log_checkpt thread consumed 100% CPU while other threads were idle.

1.4 Error Log

No relevant error messages were found.

1.5 Slow Query Log

No slow queries were recorded.

1.6 Preliminary Conclusion

Although MySQL 8.0 has a known MTS bug (Bug 103636) that can cause workers to wait indefinitely, the observed thread states do not match the bug’s typical pattern, so the bug is unlikely the cause.

2 Source‑Code‑Based Deep Dive

The MTS workflow was traced through the source files sql/rpl_replica.cc, sql/log_event.cc, and related commit‑order manager code. Key steps include allocation of a commit order manager, registration of transactions, waiting on MDL locks in ordered_commit, and the eventual release of locks after commit.

// STEP‑1: allocate commit_order_manager if replica_preserve_commit_order is on
if (opt_replica_preserve_commit_order && !rli->is_parallel_exec() &&
    rli->opt_replica_parallel_workers > 1) {
    commit_order_mngr = new Commit_order_manager(rli->opt_replica_parallel_workers);
    rli->set_commit_order_manager(commit_order_mngr);
}
// ... later in ordered_commit the worker may wait on MDL lock
Commit_order_manager::wait(...);

Workers are classified as:

A class: 4 workers in Waiting for preceding transaction to commit

B class: 11 workers in Applying batch of row changes

C class: 1 worker in Executing event

Argument 1

If Bug 103636 were responsible, all workers would be stuck in Waiting for preceding transaction to commit , which is not the case.

Argument 2

It is more plausible that other factors prevent B and C workers from committing, causing A workers to wait for a smaller commit sequence number.

3 Root‑Cause Analysis

3.1 ib_log_checkpt Stack Analysis

Top, stack traces, and perf reports show the checkpoint thread repeatedly executing dirty page flush via buf_pool_get_oldest_modification_approx, suggesting a bottleneck in locating the oldest dirty page.

3.2 System I/O Load

IO monitoring images indicate that overall disk I/O is not saturated, contradicting the hypothesis of slow async IO.

3.3 Re‑examination of InnoDB Status

The redo log usage was calculated: LSN difference = 2.68 GB, while the total redo log size (3 × 1 GB) is 3 GB, resulting in an 89 % utilization—well above the recommended 75 % threshold.

4 Problem Summary and Recommendations

4.1 Summary

The root cause is an undersized Redo Log configuration; during the incident the replica’s redo usage exceeded safe limits, preventing checkpoints from completing and causing worker threads to block on MDL locks in ordered commit.

4.2 Suggested Solutions

Increase the size of Redo Log files.

Upgrade to the latest MySQL 8.0 release to obtain fixes for Bug 103636 and related issues.

Enlarge innodb_buffer_pool_size to provide more memory for InnoDB.

4.3 Data‑Collection Guidance for Complex Issues

A comprehensive set of diagnostic commands is provided, covering process lists, global variables, status, slave status, performance schema threads, InnoDB transactions, mutexes, data locks, stack traces, and perf recordings. The commands are wrapped in pre blocks to preserve formatting.

su - mysql
currdt=`date +%Y%m%d_%H%M%S`
mkdir /tmp/diag_info_`hostname -i`_$currdt
# ... (rest of the script as in the original article) ...

These steps help reproduce the environment, capture relevant metrics, and facilitate root‑cause identification for similar MySQL replication stalls.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

performance InnoDB mysql Replication MTS RootCause

Written by

Aikesheng Open Source Community

The Aikesheng Open Source Community provides stable, enterprise‑grade MySQL open‑source tools and services, releases a premium open‑source component each year (1024), and continuously operates and maintains them.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.