Why Flush Tables with Read Lock Can Deadlock MySQL Replication and How to Fix It
The article analyzes a rare three‑thread deadlock caused by FLUSH TABLES WITH READ LOCK during parallel replication on MySQL 5.7, explains the MDL lock interactions, shows how to reproduce the issue, and provides practical kill‑thread and configuration solutions to avoid it.
Background
A production MySQL 5.7.18 instance (with minor internal patches) experienced a backup‑related replica stall. The SHOW FULL PROCESSLIST output revealed two deadlocks involving the backup thread and parallel replication workers.
Deadlock Cases
Case 1 (Figure 1) : Thread 162 runs innobackup and holds a global read lock (MDL::global read lock). Thread 144 is the parallel replication coordinator, while threads 145 and 146 are workers. Thread 145 waits for the global read lock to be released, thread 146 holds a global commit lock, and thread 162 blocks on the commit lock held by thread 146, forming a 145→162→146→145 cycle.
Case 2 (Figure 2) : Thread 183 (backup) holds the global read lock, thread 165 is the coordinator, and threads 166/167 are workers. Thread 165 waits for a dependent transaction, while thread 183 blocks on the commit lock held by thread 167, and thread 166 blocks on the read lock held by thread 183, creating a 183→167→166→183 cycle.
MDL Lock Analysis
The deadlocks involve two service‑level MDL locks:
MDL::global read lock (shared)
MDL::global commit lock (shared)
During any data‑modifying transaction MySQL also requests an intention‑exclusive lock on the global MDL key and, before commit, an intention‑exclusive lock on the commit MDL key. These locks are not compatible with the shared locks held by the backup thread.
Reproduction Steps
On the primary, create parallel‑replication transactions and synchronize them with DEBUG_SYNC to pause at the flush stage.
On the replica, modify the source code to insert sleep(20) at critical points in Xid_apply_log_event::do_apply_event_worker and the worker dispatch loop, then rebuild the server.
Example debug commands:
session 1
SET DEBUG_SYNC='waiting_in_the_middle_of_flush_stage SIGNAL s1 WAIT_FOR f';
INSERT INTO test.test VALUES (13); // Transaction A
session 2
SET DEBUG_SYNC='now WAIT_FOR s1';
SET DEBUG_SYNC='bgc_after_enrolling_for_flush_stage SIGNAL f';
INSERT INTO test.test VALUES (16); // Transaction BModified code snippet (inserted sleeps):
// Xid_apply_log_event::do_apply_event_worker
if (w->id == 0) {
std::cout << "before commit" << std::endl;
sleep(20);
}
// pop_jobs_item
if (worker->id == 0) {
sleep(20);
}Resolution
For the first deadlock, killing the backup thread (thread 162) is the safest because the waiting worker (thread 146) is stuck in mysql_cond_wait and cannot be killed.
For the second deadlock, also kill the backup thread (thread 183) to break the cycle.
Prevention
Set kill‑long‑queries‑timeout in XtraBackup to automatically kill long‑running queries that block the global lock.
Use the safe‑slave‑backup option to pause the SQL thread during backup (though this is aggressive).
Disable ordered commit on the replica by setting slave_preserve_commit_order=0, which removes the ordering constraint that contributes to the deadlock.
Adjust kill‑long‑query‑type explicitly (e.g., ALL or SELECT) because the default may not take effect without a proper value.
Summary
The root cause of the replica stall is that FLUSH TABLES WITH READ LOCK acquires two MDL locks (global read and global commit) which are not atomic with transaction execution; combined with parallel replication and ordered commit, three threads can deadlock. The article also uncovers a subtle bug in XtraBackup’s kill‑long‑query‑type handling, emphasizing the need to specify the type explicitly.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
