Analysis of MySQL Group Replication Data Inconsistency Caused by GTID Mismatch and Paxos Proposal Conflict
This article examines a real‑world MySQL Group Replication failure where network jitter caused a primary INSERT not to replicate, leading to GTID divergence, a secondary node leaving the cluster, and explains the underlying Paxos‑based proposal conflict with detailed code examples.
Introduction: This article, part of the MySQL column series by the Aikesheng operations team, presents a real‑world case where a MySQL Group Replication (MGR) cluster experienced data inconsistency and a secondary node left the group due to network jitter.
Problem phenomenon: In a production single‑primary MGR cluster, the primary executed an INSERT on table world.IC_WB_RELEASE with GTID 86afb16f‑1b8c‑11e8‑812f‑0050568912a4:57305280, but the secondary’s binlog shows a DELETE on the same GTID, indicating the INSERT was not replicated.
SET @@SESSION.GTID_NEXT='86afb16f-1b8c-11e8-812f-0050568912a4:57305280'; ...Problem analysis: The missing INSERT caused GTID divergence; subsequent DELETE could not find the row on the secondary, causing it to abort and leave the group. The root cause is traced to a Paxos‑based proposal conflict where a secondary, after missing the primary’s learn_op, issued a no‑op proposal with a higher ballot, overriding the primary’s transaction.
Related background: MGR’s Xcom component implements Paxos. Proposers send prepare requests, acceptors respond with ack_prepare/ack_accept, and learners confirm the value. The ballot consists of a numeric part and a node identifier; the node identifier determines priority when numbers are equal.
Analysis process: (1) Primary sends prepare for INSERT, receives majority acks, proceeds to accept. (2) Some secondary misses the learn_op and starts a new prepare with a no‑op value and a higher ballot (1.1). (3) The higher ballot wins, the no‑op is committed, and the primary’s INSERT is never applied on that node. (4) GTID counters continue to increase, so later transactions share the same GTID but different payloads, leading to inconsistency.
handle_ack_prepare has the following code:
if (gt_ballot(m->proposal,p->proposer.msg->proposal)) {
replace_pax_msg(&p->proposer.msg, m);
...
}Conclusion: The issue is fixed in MySQL 5.7.26 and 8.0.16 (community) and via hotfix for Enterprise. Until upgrading, administrators must manually verify that binlog GTID information matches the new primary before reintegrating a kicked‑out node.
Aikesheng Open Source Community
The Aikesheng Open Source Community provides stable, enterprise‑grade MySQL open‑source tools and services, releases a premium open‑source component each year (1024), and continuously operates and maintains them.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.