How to Diagnose and Fix a Galera Cluster Node Failure in Percona XtraDB
This article walks through a real‑world Galera cluster outage, explains the replication and flow‑control mechanisms, details the step‑by‑step analysis of thread‑running spikes and wsrep delays, identifies a network‑induced port latency as the root cause, and describes the recovery actions taken to restore the Percona XtraDB Cluster.
Background
The author, a DBA at China Mobile, manages a production Percona XtraDB Cluster (PXC) – a MySQL‑compatible Galera‑based cluster that provides synchronous replication using write‑set broadcasting and certification.
Galera Replication Overview
Key points of Galera replication:
Transactions are executed locally with an optimistic strategy; after successful broadcast, conflict detection occurs.
If a conflict is detected, the local transaction is rolled back.
Each node processes the write‑set queue independently and asynchronously.
After a transaction commits on the originating node, other nodes guarantee execution, which may introduce a delay (virtual sync).
Galera implements flow control to prevent the execution queue on slow nodes from growing indefinitely. When a node’s queue exceeds a threshold, it broadcasts an FC_PAUSE message, causing all nodes to pause broadcasting until the queue shrinks.
Problem Description
On January 29 at 10:08 am, Zabbix alerted that two nodes of the production cluster reported active thread counts exceeding thresholds and continuously rising. Business queries failed, while CPU, I/O, and memory on the servers appeared idle.
Analysis
Investigation steps:
Checked SHOW GLOBAL STATUS LIKE 'Threads_running' on both nodes – values were 100 and 110, respectively.
Examined SHOW PROCESSLIST and found many threads stuck in wsrep in pre‑commit stage, indicating they had been broadcast but were waiting for certification and queue execution on other nodes.
Queried SHOW GLOBAL STATUS LIKE '%wsrep%' – no queue blockage or flow‑control flags were observed.
Observed wsrep_evs_delayed reporting delayed connections to node 3 on port 4567. Error logs showed reconnection attempts between nodes 1, 2 and node 3.
The 4567 port is used by gmcast.listen_addr for intra‑cluster communication (handshake, authentication, broadcast).
Root cause: network packet loss between the core switch and access switch caused severe latency on port 4567, breaking timely replication from nodes 1 & 2 to node 3. Consequently, threads on nodes 1 & 2 could not finish.
+------------------------------+-------------+
| Variable_name | Value |
+------------------------------+-------------+
| wsrep_local_recv_queue | 0 |
| wsrep_local_recv_queue_avg | 0.008711 |
| wsrep_flow_control_paused | 0.000000 |
| wsrep_flow_control_sent | 0 |
| wsrep_flow_control_recv | 0 |
| wsrep_evs_delayed | node3:4567 |
+------------------------------+-------------+Resolution
During the network outage, node 3 was expelled from the cluster due to severe delay, and the remaining nodes began to experience split‑brain. The team shut down all three nodes, then ran mysqld_safe --wsrep-recover on each to locate the latest transaction ID and selected one node as the primary to start in single‑node mode.
After the network issue was fixed, nodes 2 and 3 were started sequentially, and the cluster returned to normal operation.
Key Takeaways
When a Galera cluster loses connectivity on the replication port (default 4567), promptly isolate or shut down excess nodes to avoid split‑brain.
Monitoring wsrep_evs_delayed alongside network metrics can provide early warning of replication latency.
In multi‑master clusters, disabling automatic split‑brain detection with SET GLOBAL wsrep_provider_options='pc.ignore_sb=true' is possible but not recommended for production.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
