Databases 17 min read

Investigation of Deadlock Issues in MySQL Replication with Delayed Slave and Parallel Apply

This article analyzes a MySQL master‑slave setup where a delayed replica experiences growing replication lag due to deadlocks caused by REPLACE INTO statements, examines the lock behavior, reproduces the issue in a test environment, and proposes configuration and coding mitigations.

Aikesheng Open Source Community
Aikesheng Open Source Community
Aikesheng Open Source Community
Investigation of Deadlock Issues in MySQL Replication with Delayed Slave and Parallel Apply

Background

A production MySQL cluster with one master and two slaves (one of which is configured with a 1‑day delayed replication) showed that the delayed slave’s replication lag exceeded one day and kept increasing, even though the server status appeared normal.

Running show slave status revealed that both IO and SQL threads reported YES, but the SQL thread actually had errors such as Last_Errno: 1205 and a long Seconds_Behind_Master value.

Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Last_Errno: 1205
Last_Error: Coordinator stopped because there were error(s) in the worker(s)...
Seconds_Behind_Master: 451836
SQL_Delay: 86400

Attempting to stop the slave with stop slave caused the session to hang; only a kill -9 on the MySQL server could terminate it.

The instance runs MySQL 5.7.31 with the following relevant parameters:

slave_parallel_type = LOGICAL_CLOCK
slave_parallel_workers = 16
slave_preserve_commit_order = on
binlog_transaction_dependency_tracking = WRITESET
transaction_isolation = READ-COMMITTED

Problem Investigation

Executing show full processlist displayed many system‑user connections waiting for workers to exit or for preceding transactions to commit, indicating that the parallel workers were blocked.

+----------+-------------+-----------------+------+---------+--------+---------------------------------------------+--------------------------------------------------------------------+
| Id       | User        | Host            | db   | Command | Time   | State                                       | Info                                                               |
+----------+-------------+-----------------+------+---------+--------+---------------------------------------------+--------------------------------------------------------------------+
| 75846950 | system user |                 | NULL | Connect | 3193   | Waiting for workers to exit                | NULL                                                               |
| ... (many similar rows) ...
+----------+-------------+-----------------+------+---------+--------+---------------------------------------------+--------------------------------------------------------------------+

Querying performance_schema.replication_applier_status_by_worker showed that workers 8, 11, and 12 repeatedly failed with error 1205 (lock wait timeout) while processing REPLACE INTO statements.

WORKER_ID: 11
LAST_ERROR_NUMBER: 1205
LAST_ERROR_MESSAGE: Worker 11 failed executing transaction 'ANONYMOUS' at master log binlog.000338, end_log_pos 40433204; Lock wait timeout exceeded; try restarting transaction

The failing GTID corresponded to REPLACE INTO statements that internally perform a DELETE followed by an INSERT due to unique‑key conflicts.

delete from test where id=30 and a='a30' and b='b30' and c='c30' and d='d30';
insert into test values(30,'a30','b30','c30','d30');

Using show engine innodb status revealed the locks held by the transaction, including IX lock on the table, X lock on the primary key row, and Next‑key locks on the unique index entries.

---TRANSACTION 5568158, ACTIVE 10 sec
TABLE LOCK table `kk`.`test` trx id 5568158 lock mode IX
RECORD LOCKS ... index PRIMARY ... lock_mode X locks rec but not gap
RECORD LOCKS ... index u_k ... lock_mode X locks rec but not gap

Test Cases

A test table test with a composite primary key, a composite unique index, and a secondary index was created and populated with five rows. Four REPLACE INTO scenarios were executed:

replace into test values(20,'a20','b20','c20','d20');

replace into test values(25,'a25','b25','c25','d25');

replace into test values(35,'a35','b35','c35','d35');

replace into test values(40,'a40','b40','c40','d40');

Each session was blocked by the transaction from session 1, waiting for Next‑key or X locks on the relevant unique‑index entries, confirming that REPLACE INTO can lock adjacent rows when a unique‑key conflict occurs.

---TRANSACTION 5568170, ACTIVE 3 sec inserting
replace into test values(20,'a20','b20','c20','d20')
... waiting for X lock on (a='a30',b='b30',id=30)

Analysis of Replication Deadlock

With binlog_transaction_dependency_tracking = WRITESET, non‑conflicting transactions can be applied in parallel on the replica. However, because slave_preserve_commit_order = on, the replica must commit transactions in the same order as the master. When two parallel workers hold Next‑key locks on the same unique‑index entry (e.g., id = 40) while waiting for each other’s commit, a deadlock occurs.

In MySQL 5.7 this deadlock can trigger a bug where the worker thread loses its signal, causing the entire replication process to hang, as documented in bugs 87796, 89247, 95249, and 99440.

Mitigation Strategies

Reduce slave_parallel_workers if the workload permits.

Disable commit‑order preservation by setting slave_preserve_commit_order = off.

Replace REPLACE INTO with explicit SELECT‑then‑INSERT or UPDATE logic to avoid Next‑key locks.

Upgrade to MySQL 8.0.23 or later where the bug is fixed; the fix for 5.7 is not yet scheduled.

References

MySQL 5.7 REPLACE documentation: https://dev.mysql.com/doc/refman/5.7/en/replace.html

Related analysis articles and bug reports (links provided in the original text).

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

performancedatabasemysqlReplicationParallel Apply
Aikesheng Open Source Community
Written by

Aikesheng Open Source Community

The Aikesheng Open Source Community provides stable, enterprise‑grade MySQL open‑source tools and services, releases a premium open‑source component each year (1024), and continuously operates and maintains them.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.