Databases 20 min read

Orchestrator Failover Process Source Code Analysis – Simulating Faults and Understanding ContinuousDiscovery

This article walks through a simulated MySQL 3307 cluster failure, examines Orchestrator's source code to explain the ContinuousDiscovery loop, discovery queues, health ticks, caretaking tasks, raft coordination, topology snapshots, and the logic distinguishing UnreachableMaster from DeadMaster states.

Aikesheng Open Source Community
Aikesheng Open Source Community
Aikesheng Open Source Community
Orchestrator Failover Process Source Code Analysis – Simulating Faults and Understanding ContinuousDiscovery

The author, a DBA expert, demonstrates a fault simulation on a MySQL 3307 cluster consisting of one master (centos-1) and two replicas (centos-2, centos-3). The master is stopped using systemctl stop mysql3307 , and the resulting logs show connection errors and detection of an UnreachableMaster condition.

Source‑code analysis reveals that Orchestrator discovers the failure by scanning logs for the entry executeCheckAndRecoverFunction: proceeding with , which leads to the function executeCheckAndRecoverFunction . This function is invoked from the HTTP entry point orchestrator -config orchestrator.conf.json -debug http , which calls go logic.ContinuousDiscovery() .

ContinuousDiscovery starts an infinite asynchronous discovery process. It launches a goroutine handleDiscoveryRequests() that iterates over the discoveryQueue channel, calling DiscoverInstance for each entry. DiscoverInstance eventually connects to MySQL instances via ReadTopologyInstanceBufferable and writes metrics to the database_instance table.

The discovery queue is populated in two ways: manual triggers (via the discover API) that push ReplicaKey and MasterKey , and a periodic healthTick (default 1 s) that runs onHealthTick() to push expired instances.

Other periodic ticks include instancePollTick (default 5 s) for routine operations, autoPseudoGTIDTick (every 5 s) for pseudo‑GTID injection, caretakingTick (every minute) for maintenance tasks such as forgetting unseen instances, expiring audits, and cleaning up hostname resolves, and raftCaretakingTick (every 10 min) for Raft‑mode synchronization.

When SnapshotTopologiesIntervalHours is greater than zero, a snapshotTopologiesTick saves the current topology into database_instance_topology_history at the configured interval.

The recovery loop runs on recoveryTick (configurable seconds). It clears stale detections, runs CheckAndRecover , and ensures only one recovery runs at a time using atomic.CompareAndSwapInt64(&recoveryEntrance, 0, 1) . The function first obtains a ReplicationAnalysis slice via GetReplicationAnalysis , which classifies each instance (e.g., UnreachableMaster , DeadMaster ) based on attributes such as IsMaster , LastCheckValid , CountValidReplicas , and CountValidReplicatingReplicas .

For a master that has just gone down, the analysis yields UnreachableMaster because the master cannot be reached but at least one replica is still replicating. The corresponding recovery function checkAndRecoverGenericProblem does nothing, so the system waits. Once all replicas stop replicating, the analysis switches to DeadMaster , and checkAndRecoverDeadMaster is invoked to perform failover.

The article also explains key columns in the database_instance table ( last_checked , last_seen , last_check_partial_success , last_attempted_check ) and how they influence the detection logic.

GoMySQLReplicationFailoverOrchestratorDatabase HAContinuousDiscovery
Aikesheng Open Source Community
Written by

Aikesheng Open Source Community

The Aikesheng Open Source Community provides stable, enterprise‑grade MySQL open‑source tools and services, releases a premium open‑source component each year (1024), and continuously operates and maintains them.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.