Databases 11 min read

Why Percona XtraBackup Can Deadlock MySQL 5.6 Replicas and How to Fix It

This article analyzes two deadlock cases that occur when running Percona‑XtraBackup on a MySQL 5.6 replica, explains the underlying lock interactions, shows how to reproduce the issue with SystemTap, and provides safe remediation steps for operators.

ITPUB
ITPUB
ITPUB
Why Percona XtraBackup Can Deadlock MySQL 5.6 Replicas and How to Fix It

Background

The article analyses deadlock situations that occur when a physical backup is performed on a MySQL 5.6 replica using Percona‑XtraBackup (PXB). The deadlock is caused by interactions between the backup thread, the replication SQL thread and, in a second case, a user thread that issues STOP SLAVE.

Percona‑XtraBackup workflow

Copy InnoDB redo logs in a dedicated thread until the backup finishes.

Copy all InnoDB .ibd files.

Acquire a global read lock with FLUSH TABLES WITH READ LOCK (FTWRL). This obtains both the global MDL lock and the COMMIT MDL lock.

Copy metadata files such as .frm, .MYD, .MYI.

Obtain replication positions by executing SHOW SLAVE STATUS and SHOW MASTER STATUS.

Release the lock with UNLOCK TABLES.

Perform final cleanup.

Because a global read lock also blocks MyISAM tables for a long time, backups are usually run on a replica.

Deadlock analysis

Case 1

During the backup the SHOW SLAVE STATUS command hangs and the replication SQL thread shows Waiting for commit lock . Backtraces reveal the following cycle:

# Backup thread backtrace (waiting in SHOW SLAVE STATUS)
#0 __lll_lock_wait
#1 _L_lock_974
#2 __GI___pthread_mutex_lock
#3 inline_mysql_mutex_lock
#4 show_slave_status
#5 mysql_execute_command
# ...

# SQL thread backtrace (waiting for COMMIT lock)
#0 pthread_cond_timedwait
#1 inline_mysql_cond_timedwait
#2 MDL_wait::timed_wait
#3 MDL_context::acquire_lock
#4 ha_commit_trans
#5 trans_commit
#6 Xid_log_event::do_commit
#7 Xid_log_event::do_apply_event
#8 Log_event::apply_event
#9 apply_event_and_update_pos
#10 exec_relay_log_event
#11 handle_slave_sql
# ...

Sequence of events:

The backup thread executes FTWRL and acquires the COMMIT MDL lock.

The SQL thread reaches an Xid event, tries to acquire the COMMIT lock and blocks.

The backup thread calls SHOW SLAVE STATUS, which requires the rli->data_lock mutex. This mutex is held by the SQL thread.

The rli->data_lock is a plain pthread_mutex_t, not an MDL lock, so MySQL’s deadlock detector does not see the cycle.

Case 2

After fixing the original bug, a new deadlock appears when a user issues STOP SLAVE while the replica is already in the state described in Case 1. The backtrace of the STOP SLAVE thread shows it waiting for the SQL thread to exit, while the SQL thread is still blocked on the COMMIT lock held by the backup thread.

#0 pthread_cond_timedwait
#1 inline_mysql_cond_timedwait
#2 terminate_slave_thread
#3 terminate_slave_threads
#4 stop_slave
#5 mysql_execute_command
# ...

Deadlock chain (three threads):

Backup thread holds the COMMIT lock (acquired by FTWRL).

SQL thread requests the COMMIT lock during the Xid event and blocks.

User thread runs STOP SLAVE, waits for the SQL thread, but the backup thread now needs LOCK_active_mi for SHOW SLAVE STATUS, which the user thread holds.

Because this situation is caused by the order of administrative commands, the recommendation is to avoid issuing STOP SLAVE while the replica is waiting for a commit lock.

Deadlock resolution

If the deadlock occurs, the practical remedy is to kill the blocked SQL thread. Killing is safe when the SQL thread has already reached the Xid event, because the transaction can be rolled back and replayed. MySQL 5.6 versions prior to 5.6.21 contain a bug where killing a thread that is waiting for the COMMIT lock can cause the transaction to be skipped, leading to data divergence. Upgrading to 5.6.21 or later eliminates this risk.

Reproducing the deadlock

To reproduce the deadlock for testing:

Running FLUSH TABLES WITH READ LOCK directly on the replica often blocks on the global read lock rather than the COMMIT lock, so the deadlock may not appear.

Use MySQL’s debug‑sync feature to pause threads at specific points; this requires a debug‑build of mysqld.

Modify the source to artificially delay do_commit (e.g., insert a sleep) and recompile.

If source changes are not feasible, SystemTap can be used to inject a delay at runtime.

SystemTap to trigger the deadlock

SystemTap can probe user‑space functions when mysqld is compiled with debug symbols ( -g). Example commands:

sudo stap -L 'process("/usr/sbin/mysqld").function("*")'

List probes for Xid_log_event functions:

sudo stap -L 'process("/usr/sbin/mysqld").function("*Xid_log_event::*")'

Inject a 3‑second delay in Xid_log_event::do_commit (replace 16011 with the PID of the replica process):

sudo stap -v -g -d /usr/bin/mysqld --ldd \
  -e 'probe process(16011).function("Xid_log_event::do_commit") { printf("got it
"); mdelay(3000) }'

When the probe fires, the SQL thread pauses, allowing the backup thread to acquire the COMMIT lock before the SQL thread does, reproducing the deadlock.

References

Original article: http://www.kancloud.cn/taobaomysql/monthly/117960

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

mysqlReplicationSystemTapPercona XtraBackup
ITPUB
Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.