Databases 22 min read

Understanding Orchestrator's RegroupReplicasGTID and Candidate Replica Selection in MySQL Failover

This article explains how Orchestrator selects a candidate replica during MySQL master failover, detailing the GetCandidateReplica and RegroupReplicasGTID functions, their sorting logic, promotion rules, GTID-based regrouping, and differences from MHA, while highlighting potential data loss issues and related bugs.

Aikesheng Open Source Community
Aikesheng Open Source Community
Aikesheng Open Source Community
Understanding Orchestrator's RegroupReplicasGTID and Candidate Replica Selection in MySQL Failover

The article provides an in‑depth analysis of Orchestrator's failover process for MySQL, focusing on how a candidate replica is chosen and how replicas are regrouped using GTID.

GetCandidateReplica

GetCandidateReplica first reads the master instance (identified only by hostname and port) from the backend database_instance table, then retrieves all its replicas into a slice replicas [](*Instance) . It sorts these replicas by execution binlog coordinates using sortedReplicasDataCenterHint , which internally calls StopReplicas and sortInstancesDataCenterHint . The sorting prioritises the most up‑to‑date replica, and when coordinates are equal it prefers replicas in the same data center as the dead master.

func GetCandidateReplica(masterKey *InstanceKey, forRematchPurposes bool) (*Instance, [](*Instance), [](*Instance), [](*Instance), [](*Instance), error) {
    // ... read master, get replicas, sort, choose candidate ...
}

sortedReplicasDataCenterHint

This function stops replication on all replicas (using StopReplicationNice with a configurable timeout) and then sorts them by ExecBinlogCoordinates . The most up‑to‑date replica ends up at index 0.

func sortedReplicasDataCenterHint(replicas [](*Instance), stopReplicationMethod StopReplicationMethod, dataCenterHint string) [](*Instance) {
    // stop replication, remove nils, sort, return
}

chooseCandidateReplica

After sorting, chooseCandidateReplica iterates over the replicas to find the first one that passes several checks (validity, not banned, matching the most common major version and binlog format). If none passes, it falls back to the first non‑banned replica.

func chooseCandidateReplica(replicas [](*Instance)) (candidateReplica *Instance, aheadReplicas, equalReplicas, laterReplicas, cannotReplicateReplicas [](*Instance), err error) {
    // ... select candidate, classify other replicas ...
}

RegroupReplicasGTID

This function calls GetCandidateReplica to obtain a candidate and then moves the remaining replicas (except those ahead or unable to replicate) under the candidate using GTID. It optionally postpones the GTID move if a user‑provided callback deems the candidate ideal.

func RegroupReplicasGTID(masterKey *InstanceKey, returnReplicaEvenOnFailureToRegroup bool, startReplicationOnCandidate bool, onCandidateReplicaChosen func(*Instance), postponedFunctionsContainer *PostponedFunctionsContainer, postponeAllMatchOperations func(*Instance, bool) bool) (lostReplicas [](*Instance), movedReplicas [](*Instance), cannotReplicateReplicas [](*Instance), candidateReplica *Instance, err error) {
    // ... get candidate, classify replicas, move via GTID ...
}

Promotion Rules and Sorting Details

The InstancesSorterByExec.Less method defines the final ordering when ExecBinlogCoordinates are equal. It prefers replicas with logging updates enabled, smaller major version, smaller binlog format, same data center as the dead master, no errant GTID, and finally a better PromotionRule .

func (this *InstancesSorterByExec) Less(i, j int) bool {
    // ... series of comparisons ...
    return this.instances[i].ExecBinlogCoordinates.SmallerThan(&this.instances[j].ExecBinlogCoordinates)
}

Comparison with MHA

Unlike MHA, which shuts down all replica I/O threads before promotion, Orchestrator performs a graceful stop (StopReplicationNice) with a timeout, then proceeds with GTID‑based regrouping. This approach favours availability over strict data consistency, which can lead to data loss if the most up‑to‑date replica lacks required configuration.

Known Issues

The article also points out a bug in DelayMasterPromotionIfSQLThreadNotUpToDate , where the SQL thread may never catch up because the code never starts the slave SQL thread after a dead master, causing a timeout.

Overall, the piece explains the complete flow from candidate selection to replica regrouping, the sorting criteria, promotion rule handling, and the trade‑offs between availability and data integrity in Orchestrator's MySQL failover mechanism.

MySQLReplicationFailoverGTIDOrchestrator
Aikesheng Open Source Community
Written by

Aikesheng Open Source Community

The Aikesheng Open Source Community provides stable, enterprise‑grade MySQL open‑source tools and services, releases a premium open‑source component each year (1024), and continuously operates and maintains them.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.