Why Does Percona XtraDB Cluster Hang? Uncovering the wsrep_group_commit Deadlock
This article investigates a recurring crash in Percona XtraDB Cluster 8.0.41 caused by the removal of the wsrep_group_commit_queue component, analyzes error logs and core dumps with GDB, reveals a circular deadlock among internal threads, and provides reproducible steps and debugging tips for DBAs.
Percona XtraDB Cluster (PXC) version 8.0.41 removed the wsrep_group_commit_queue component because several deadlock bugs were traced back to it.
Symptom Description
A customer runs two PXC clusters in separate data centers, synchronizing via MySQL native replication. During heavy load, multiple nodes in the PXC cluster hang and crash after 600 seconds, producing a fatal semaphore wait error.
Error Log Analysis
The MySQL error.log contains entries such as:
2025-07-07T05:10:25.772284Z 0 [ERROR] [MY-012872] [InnoDB] [FATAL] Semaphore wait has lasted > 600 seconds. We intentionally crash the server because it appears to be hung.
2025-07-07T05:10:25Z UTC - mysqld got signal 6 ;
Most likely, you have hit a bug, but this error can also be caused by malfunctioning hardware.This is a typical crash triggered by the InnoDB monitor thread when a semaphore wait exceeds 600 seconds.
Corefile Analysis
Using GDB, the thread with pthread ID 140395374548736 (named thd1) is found waiting on the condition variable COND_wsrep_group_commit. The backtrace shows the thread stuck in wsrep_wait_for_turn_in_group_commit at line 468.
#0 0x00007fb0b324c48c in pthread_cond_wait@@GLIBC_2.3.2 ()
#1 0x00000000012578f0 in native_cond_wait (mutex=<optimized out>, cond=<optimized out>) at thr_cond.h:161
#2 my_cond_wait (mp=<optimized out>, cond=<optimized out>) at thr_cond.h:161
#3 inline_mysql_cond_wait (src_file=".../sql/wsrep_binlog.cc", src_line=468) at mysql_cond.h:198
#4 wsrep_wait_for_turn_in_group_commit (thd=0x7fb08800d430) at wsrep_binlog.cc:468The condition variable COND_wsrep_group_commit is used in a loop that checks whether the current thread is at the front of wsrep_group_commit_queue:
while (true) {
if (thd == wsrep_group_commit_queue.front()) {
break;
} else {
mysql_cond_wait(&COND_wsrep_group_commit, &LOCK_wsrep_group_commit);
}
} wsrep_group_commit_queueis defined as a std::queue<THD *>, a FIFO queue of thread descriptors.
Deadlock Scenario
Three threads are identified: thd1 (pthread ID 0x7fb0586d9700) waiting on COND_wsrep_group_commit for thd2. thd2 (pthread ID 0x7fb0585d7700) waiting on m_stage_cond_binlog for thd3. thd3 (the leader) processing the commit queue but never removing thd2.
This creates a circular wait: thd1 → thd2 → thd3 → thd1, resulting in a deadlock that eventually triggers the semaphore timeout.
Interaction with finish_transaction_in_engines
The function finish_transaction_in_engines iterates over the commit queue using the next_to_commit pointer and calls wsrep_unregister_from_group_commit to pop the thread from the queue and broadcast COND_wsrep_group_commit. However, in the observed deadlock, thd2 never reaches the unregister step, leaving it stuck in the queue.
Reproduction Steps
Deploy a PXC cluster and a MySQL replication cluster (using dbdeployer).
Configure the PXC node as a replica of the MySQL master.
Create a test table and run concurrent UPDATE statements on both clusters.
Run sysbench against the PXC node while the master continuously updates the same row.
Observe that TPS drops to zero and InnoDB status shows long semaphore waits.
The key conditions for reproducing the bug are:
The SQL thread updates a row but the new value is identical to the existing one, producing an empty binlog event.
There is at least one other thread behind the SQL thread in the group commit queue.
Findings and Recommendations
The deadlock is caused by the leader thread not removing the follower ( thd2) from wsrep_group_commit_queue.
Empty binlog events (generated when an UPDATE does not change data) are associated with the stuck thread.
Enabling wsrep_debug and adding GDB breakpoints helps trace the registration, waiting, and unregistering of threads.
Understanding the application workload (e.g., frequent no‑op updates) is essential for reproducing and fixing the issue.
Related Records
PXC‑4390
PXC‑4318
References
[1] Percona XtraDB Cluster – https://www.percona.com/resources/datasheets/percona-xtradb-cluster
[2] std::queue – https://en.cppreference.com/w/cpp/container/queue
Aikesheng Open Source Community
The Aikesheng Open Source Community provides stable, enterprise‑grade MySQL open‑source tools and services, releases a premium open‑source component each year (1024), and continuously operates and maintains them.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
