DeadMaster Recovery Process in Orchestrator
This article explains the complete DeadMaster recovery workflow of Orchestrator, detailing how the system selects the appropriate check‑and‑recover function, handles emergency grace periods, reads topology information, registers recovery attempts, validates promotion constraints, executes the actual failover, and runs post‑recovery hooks, with extensive Go code examples.
The article walks through the DeadMaster recovery flow used by the Orchestrator HA tool for MySQL. It starts by calling getCheckAndRecoverFunction to obtain the appropriate checkAndRecoverFunction based on the analysis code.
When the analysis code is inst.DeadMaster or inst.DeadMasterAndSomeReplicas , the function first checks whether the instance is inside an EmergencyOperationGracefulPeriod using isInEmergencyOperationGracefulPeriod . This check looks up the emergencyOperationGracefulPeriodMap cache, which is populated when the master is in the UnreachableMaster state during executeCheckAndRecoverFunction → runEmergentOperations .
func getCheckAndRecoverFunction(analysisCode inst.AnalysisCode, analyzedInstanceKey *inst.InstanceKey) (checkAndRecoverFunction func(analysisEntry inst.ReplicationAnalysis, candidateInstanceKey *inst.InstanceKey, forceInstanceRecovery bool, skipProcesses bool) (recoveryAttempted bool, topologyRecovery *TopologyRecovery, err error), isActionableRecovery bool) { ... } func isInEmergencyOperationGracefulPeriod(instanceKey *inst.InstanceKey) bool { _, found := emergencyOperationGracefulPeriodMap.Get(instanceKey.StringCode()); return found }If the instance is not in the grace period, executeCheckAndRecoverFunction proceeds to call runEmergentOperations , which launches emergentlyReadTopologyInstance and emergentlyReadTopologyInstanceReplicas in separate goroutines. These functions add the instance key to emergencyReadTopologyInstanceMap before reading the topology.
func emergentlyReadTopologyInstance(instanceKey *inst.InstanceKey, analysisCode inst.AnalysisCode) (instance *inst.Instance, err error) { if existsInCacheError := emergencyReadTopologyInstanceMap.Add(instanceKey.StringCode(), true, cache.DefaultExpiration); existsInCacheError != nil { return nil, nil } instance, err = inst.ReadTopologyInstance(instanceKey) ... }The cache emergencyOperationGracefulPeriodMap is created during package initialization with a 5‑second expiration:
emergencyOperationGracefulPeriodMap = cache.New(time.Second*5, time.Millisecond*500)After the emergent reads, executeCheckAndRecoverFunction obtains the checkAndRecoverFunction (for DeadMaster it is checkAndRecoverDeadMaster ) and checks whether global recovery is disabled via IsRecoveryDisabled . If disabled and not forced, the function returns without recovery.
func executeCheckAndRecoverFunction(analysisEntry inst.ReplicationAnalysis, candidateInstanceKey *inst.InstanceKey, forceInstanceRecovery bool, skipProcesses bool) (recoveryAttempted bool, topologyRecovery *TopologyRecovery, err error) { ... }When recovery proceeds, checkAndRecoverDeadMaster first verifies that the cluster allows automated master recovery (via RecoverMasterClusterFilters ). It then attempts to register a recovery entry with AttemptRecoveryRegistration , which blocks duplicate recoveries within the RecoveryPeriodBlockMinutes window.
func checkAndRecoverDeadMaster(analysisEntry inst.ReplicationAnalysis, candidateInstanceKey *inst.InstanceKey, forceInstanceRecovery bool, skipProcesses bool) (recoveryAttempted bool, topologyRecovery *TopologyRecovery, err error) { ... }If registration succeeds, the function calls recoverDeadMaster . This function determines the recovery type (GTID, Pseudo‑GTID, or BinlogServer) and then regroups replicas using RegroupReplicasGTID . It also builds a closure promotedReplicaIsIdeal to decide whether the promoted replica satisfies geographic and lag constraints; if not, the topology is reorganised to promote the preferred replica.
func recoverDeadMaster(topologyRecovery *TopologyRecovery, candidateInstanceKey *inst.InstanceKey, skipProcesses bool) (recoveryAttempted bool, promotedReplica *inst.Instance, lostReplicas [](*inst.Instance), err error) { ... }Before the actual promotion, several checks are performed: PreventCrossDataCenterMasterFailover , PreventCrossRegionMasterFailover , FailMasterPromotionOnLagMinutes , FailMasterPromotionIfSQLThreadNotUpToDate , and DelayMasterPromotionIfSQLThreadNotUpToDate . If any check fails, the promotion is aborted.
On successful promotion, Orchestrator optionally runs RESET SLAVE ALL and sets read_only=0 on the new master (controlled by ApplyMySQLPromotionAfterMasterFailover ), detaches the old master, and may execute PostMasterFailoverProcesses . Lost replicas can be detached in parallel if DetachLostReplicasAfterMasterFailover is enabled.
The article concludes that the DeadMaster recovery involves many coordinated steps, cache checks, recovery registration, promotion validation, topology regrouping, and post‑recovery hooks, and hints at a follow‑up part III for deeper source‑code analysis.
Aikesheng Open Source Community
The Aikesheng Open Source Community provides stable, enterprise‑grade MySQL open‑source tools and services, releases a premium open‑source component each year (1024), and continuously operates and maintains them.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.