Databases 9 min read

How to Diagnose and Fix a Galera Cluster Node Failure in Percona XtraDB

This article walks through a real‑world Galera cluster outage, explains the replication and flow‑control mechanisms, details the step‑by‑step analysis of thread‑running spikes and wsrep delays, identifies a network‑induced port latency as the root cause, and describes the recovery actions taken to restore the Percona XtraDB Cluster.

dbaplus Community
dbaplus Community
dbaplus Community
How to Diagnose and Fix a Galera Cluster Node Failure in Percona XtraDB

Background

The author, a DBA at China Mobile, manages a production Percona XtraDB Cluster (PXC) – a MySQL‑compatible Galera‑based cluster that provides synchronous replication using write‑set broadcasting and certification.

Galera Replication Overview

Key points of Galera replication:

Transactions are executed locally with an optimistic strategy; after successful broadcast, conflict detection occurs.

If a conflict is detected, the local transaction is rolled back.

Each node processes the write‑set queue independently and asynchronously.

After a transaction commits on the originating node, other nodes guarantee execution, which may introduce a delay (virtual sync).

Galera implements flow control to prevent the execution queue on slow nodes from growing indefinitely. When a node’s queue exceeds a threshold, it broadcasts an FC_PAUSE message, causing all nodes to pause broadcasting until the queue shrinks.

Problem Description

On January 29 at 10:08 am, Zabbix alerted that two nodes of the production cluster reported active thread counts exceeding thresholds and continuously rising. Business queries failed, while CPU, I/O, and memory on the servers appeared idle.

Analysis

Investigation steps:

Checked SHOW GLOBAL STATUS LIKE 'Threads_running' on both nodes – values were 100 and 110, respectively.

Examined SHOW PROCESSLIST and found many threads stuck in wsrep in pre‑commit stage, indicating they had been broadcast but were waiting for certification and queue execution on other nodes.

Queried SHOW GLOBAL STATUS LIKE '%wsrep%' – no queue blockage or flow‑control flags were observed.

Observed wsrep_evs_delayed reporting delayed connections to node 3 on port 4567. Error logs showed reconnection attempts between nodes 1, 2 and node 3.

The 4567 port is used by gmcast.listen_addr for intra‑cluster communication (handshake, authentication, broadcast).

Root cause: network packet loss between the core switch and access switch caused severe latency on port 4567, breaking timely replication from nodes 1 & 2 to node 3. Consequently, threads on nodes 1 & 2 could not finish.

+------------------------------+-------------+
| Variable_name                | Value       |
+------------------------------+-------------+
| wsrep_local_recv_queue       | 0           |
| wsrep_local_recv_queue_avg   | 0.008711    |
| wsrep_flow_control_paused    | 0.000000    |
| wsrep_flow_control_sent      | 0           |
| wsrep_flow_control_recv      | 0           |
| wsrep_evs_delayed            | node3:4567  |
+------------------------------+-------------+

Resolution

During the network outage, node 3 was expelled from the cluster due to severe delay, and the remaining nodes began to experience split‑brain. The team shut down all three nodes, then ran mysqld_safe --wsrep-recover on each to locate the latest transaction ID and selected one node as the primary to start in single‑node mode.

After the network issue was fixed, nodes 2 and 3 were started sequentially, and the cluster returned to normal operation.

Key Takeaways

When a Galera cluster loses connectivity on the replication port (default 4567), promptly isolate or shut down excess nodes to avoid split‑brain.

Monitoring wsrep_evs_delayed alongside network metrics can provide early warning of replication latency.

In multi‑master clusters, disabling automatic split‑brain detection with SET GLOBAL wsrep_provider_options='pc.ignore_sb=true' is possible but not recommended for production.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

mysqlNetwork LatencyFlow ControlGaleradatabase troubleshootingPercona XtraDB Cluster
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.