Why MySQL 8.0 MTS Hangs After 2³¹ Transactions and How to Reproduce the Bug
MySQL 8.0.28 and earlier can deadlock in multi‑threaded slave replication when the internal commit sequence counter overflows after 2³¹ transactions, causing workers to stop waking and the replica to hang; the article explains the underlying slave_preserve_commit_order mechanism, the overflow bug, and a step‑by‑step reproduction method.
Problem description
In MySQL 8.0.28 and earlier, the multi‑threaded slave (MTS) may hang after running for a while. The hang is triggered when an internal 32‑bit integer overflows after 2³¹ transactions, acting as a hidden time bomb.
Relevant bug
https://bugs.mysql.com/bug.php?id=103636
Implementation of slave_preserve_commit_order
The parameter ensures that the commit order on a replica matches the primary. Its implementation consists of two main steps:
During SQL thread dispatch, each transaction’s order is recorded using a sequence number derived from the primary’s binlog order.
Worker threads execute transactions out of order, but must commit in the recorded order.
A global Commit_order_manager tracks this information. Key fields in each worker node include: m_commit_sequence_nr: commit sequence number. value_type m_worker_id: worker thread ID. MDL_context *m_mdl_context: MDL lock information.
memory::Aligned_atomic<Commit_order_queue::enum_worker_stage> m_stage: state information.
The commit queue ( m_commit_queue) is a lock‑free FIFO that stores worker IDs in dispatch order. Workers check the head of this queue before committing; if the IDs do not match, they wait, changing their state to REQUESTED_GRANT.
Wake‑up logic
When a worker finishes applying a transaction, Commit_order_manager::finish_one attempts to wake the next waiting worker. The wake‑up succeeds only if three conditions are met:
A worker thread exists.
The next worker’s node state is FINISHED_APPLYING or REQUESTED_GRANT.
The next worker’s m_commit_sequence_nr equals the current worker’s m_commit_sequence_nr + 1.
The bug occurs in condition C because the sequence number is stored as an int after being retrieved from an unsigned long long generator. When the counter reaches 2,147,483,647, adding one overflows to -2,147,483,648, breaking the equality check and preventing any further wake‑ups.
Bug reproduction
To reproduce the issue quickly, the global commit sequence generator is initialized near the overflow point (e.g., 2,147,483,640) and the replica is configured with:
mysql> set global transaction_write_set_extraction=XXHASH64;</code><code>mysql> set global binlog_transaction_dependency_tracking=WRITESET;Eight parallel worker threads are enabled on the replica. After compiling and restarting, the hang can be observed. Debugging is performed by setting a breakpoint around line 286 of rpl_slave_commit_order_manager.cc (inside Commit_order_manager::finish_one) and inspecting the sequence numbers:
(gdb) p *(this->m_workers[next_worker].m_commit_sequence_nr.m_underlying)</code><code>(gdb) p next_seq_nrThe output shows m_commit_sequence_nr as 2,147,483,648 while next_seq_nr is -2,147,483,648, confirming the overflow.
Impact and mitigation
The issue is tied to the slave_preserve_commit_order parameter; disabling it avoids the bug because the commit order manager is not initialized.
Restarting the primary and replica resets the global counter, temporarily fixing the hang.
The bug was fixed in MySQL 8.0.28; the patch changes the type of cs::apply::Commit_order_queue::sequence_type to unsigned long long and adjusts related code.
Conclusion
The hang is a multi‑threaded replication bug caused by a 32‑bit integer overflow in the commit sequence generator.
It only manifests under high concurrency when the replica processes more than 2,147,483,647 transactions.
Disabling slave_preserve_commit_order or upgrading to MySQL 8.0.28+ resolves the problem.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
