Investigation of Deadlock Issues in MySQL Replication with Delayed Slave and Parallel Apply
This article analyzes a MySQL master‑slave setup where a delayed replica experiences growing replication lag due to deadlocks caused by REPLACE INTO statements, examines the lock behavior, reproduces the issue in a test environment, and proposes configuration and coding mitigations.
Background
A production MySQL cluster with one master and two slaves (one of which is configured with a 1‑day delayed replication) showed that the delayed slave’s replication lag exceeded one day and kept increasing, even though the server status appeared normal.
Running show slave status revealed that both IO and SQL threads reported YES, but the SQL thread actually had errors such as Last_Errno: 1205 and a long Seconds_Behind_Master value.
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Last_Errno: 1205
Last_Error: Coordinator stopped because there were error(s) in the worker(s)...
Seconds_Behind_Master: 451836
SQL_Delay: 86400Attempting to stop the slave with stop slave caused the session to hang; only a kill -9 on the MySQL server could terminate it.
The instance runs MySQL 5.7.31 with the following relevant parameters:
slave_parallel_type = LOGICAL_CLOCK
slave_parallel_workers = 16
slave_preserve_commit_order = on
binlog_transaction_dependency_tracking = WRITESET
transaction_isolation = READ-COMMITTEDProblem Investigation
Executing show full processlist displayed many system‑user connections waiting for workers to exit or for preceding transactions to commit, indicating that the parallel workers were blocked.
+----------+-------------+-----------------+------+---------+--------+---------------------------------------------+--------------------------------------------------------------------+
| Id | User | Host | db | Command | Time | State | Info |
+----------+-------------+-----------------+------+---------+--------+---------------------------------------------+--------------------------------------------------------------------+
| 75846950 | system user | | NULL | Connect | 3193 | Waiting for workers to exit | NULL |
| ... (many similar rows) ...
+----------+-------------+-----------------+------+---------+--------+---------------------------------------------+--------------------------------------------------------------------+Querying performance_schema.replication_applier_status_by_worker showed that workers 8, 11, and 12 repeatedly failed with error 1205 (lock wait timeout) while processing REPLACE INTO statements.
WORKER_ID: 11
LAST_ERROR_NUMBER: 1205
LAST_ERROR_MESSAGE: Worker 11 failed executing transaction 'ANONYMOUS' at master log binlog.000338, end_log_pos 40433204; Lock wait timeout exceeded; try restarting transactionThe failing GTID corresponded to REPLACE INTO statements that internally perform a DELETE followed by an INSERT due to unique‑key conflicts.
delete from test where id=30 and a='a30' and b='b30' and c='c30' and d='d30';
insert into test values(30,'a30','b30','c30','d30');Using show engine innodb status revealed the locks held by the transaction, including IX lock on the table, X lock on the primary key row, and Next‑key locks on the unique index entries.
---TRANSACTION 5568158, ACTIVE 10 sec
TABLE LOCK table `kk`.`test` trx id 5568158 lock mode IX
RECORD LOCKS ... index PRIMARY ... lock_mode X locks rec but not gap
RECORD LOCKS ... index u_k ... lock_mode X locks rec but not gapTest Cases
A test table test with a composite primary key, a composite unique index, and a secondary index was created and populated with five rows. Four REPLACE INTO scenarios were executed:
replace into test values(20,'a20','b20','c20','d20');
replace into test values(25,'a25','b25','c25','d25');
replace into test values(35,'a35','b35','c35','d35');
replace into test values(40,'a40','b40','c40','d40');
Each session was blocked by the transaction from session 1, waiting for Next‑key or X locks on the relevant unique‑index entries, confirming that REPLACE INTO can lock adjacent rows when a unique‑key conflict occurs.
---TRANSACTION 5568170, ACTIVE 3 sec inserting
replace into test values(20,'a20','b20','c20','d20')
... waiting for X lock on (a='a30',b='b30',id=30)Analysis of Replication Deadlock
With binlog_transaction_dependency_tracking = WRITESET, non‑conflicting transactions can be applied in parallel on the replica. However, because slave_preserve_commit_order = on, the replica must commit transactions in the same order as the master. When two parallel workers hold Next‑key locks on the same unique‑index entry (e.g., id = 40) while waiting for each other’s commit, a deadlock occurs.
In MySQL 5.7 this deadlock can trigger a bug where the worker thread loses its signal, causing the entire replication process to hang, as documented in bugs 87796, 89247, 95249, and 99440.
Mitigation Strategies
Reduce slave_parallel_workers if the workload permits.
Disable commit‑order preservation by setting slave_preserve_commit_order = off.
Replace REPLACE INTO with explicit SELECT‑then‑INSERT or UPDATE logic to avoid Next‑key locks.
Upgrade to MySQL 8.0.23 or later where the bug is fixed; the fix for 5.7 is not yet scheduled.
References
MySQL 5.7 REPLACE documentation: https://dev.mysql.com/doc/refman/5.7/en/replace.html
Related analysis articles and bug reports (links provided in the original text).
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Aikesheng Open Source Community
The Aikesheng Open Source Community provides stable, enterprise‑grade MySQL open‑source tools and services, releases a premium open‑source component each year (1024), and continuously operates and maintains them.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
