Master Election in Elasticsearch Pre‑7.x: Bully Algorithm and Source Code Analysis
This article explains the Bully election algorithm used by Elasticsearch versions prior to 7.x, describes phenomena such as false death and split‑brain, and provides a detailed walkthrough of the master‑election source code including key Java methods and configuration parameters.
In earlier Elasticsearch series we learned that the master node manages cluster state, shards, and indices, and only nodes with node.master=true can be elected as master. Before version 7.x Elasticsearch used the Bully algorithm for master election, while later versions switched to Raft.
Bully Algorithm
The Bully algorithm requires every participating node to have a unique ID and selects the alive node with the highest ID as the master. The election process involves three message types: Election (initiate election), Answer/Alive (response), and Coordinator/Victory (announce winner). Elections are triggered when a process recovers from an error or when the current leader fails.
The election steps are:
If the node has the highest ID, it broadcasts a Victory message; otherwise it sends Election messages to nodes with larger IDs.
If no Alive response is received, the node declares itself master with a Victory message.
If an Alive message from a higher‑ID node is received, the node stops sending messages and waits for a Victory.
If an Election message from a lower‑ID node arrives, the node replies with Alive and restarts the election.
Upon receiving a Victory message, the node accepts the sender as the leader.
False Death
When a minority of nodes think the master is down (e.g., only one node cannot ping the master while others can), a "false death" occurs, potentially causing frequent elections. Elasticsearch mitigates this by requiring more than half of the nodes to agree that the master is down before starting a new election.
Split‑Brain
Network partitions can cause each partition to elect its own master, leading to multiple masters. Elasticsearch resolves this by requiring a majority (⌊N/2⌋+1) of nodes to acknowledge a master before it is accepted.
Elasticsearch Master Election Source Code Analysis
Master election is triggered during node startup/shutdown, periodic master ping failures, and when other nodes cannot ping the master.
Main Process
The core logic resides in the ZenDiscovery module, specifically the innerJoinCluster method:
private void innerJoinCluster() {
DiscoveryNode masterNode = null;
final Thread currentThread = Thread.currentThread();
// start a new election round and accept votes
nodeJoinController.startElectionContext();
// loop until a temporary master is found
while (masterNode == null && joinThreadControl.joinThreadActive(currentThread)) {
masterNode = findMaster();
}
if (transportService.getLocalNode().equals(masterNode)) {
// if this node becomes temporary master, wait for other nodes to confirm
final int requiredJoins = Math.max(0, electMaster.minimumMasterNodes() - 1);
nodeJoinController.waitToBeElectedAsMaster(requiredJoins, masterElectionWaitForJoinsTimeout,
new NodeJoinController.ElectionCallback() {
@Override
public void onElectedAsMaster(ClusterState state) {
// election succeeded, finish this round
synchronized (stateMutex) {
joinThreadControl.markThreadAsDone(currentThread);
}
}
@Override
public void onFailure(Throwable t) {
// election failed, start a new round
synchronized (stateMutex) {
joinThreadControl.markThreadAsDoneAndStartNew(currentThread);
}
}
}
);
} else {
// not elected as master, stop accepting votes and vote for the elected master
nodeJoinController.stopElectionContext(masterNode + " elected");
final boolean success = joinElectedMaster(masterNode);
synchronized (stateMutex) {
if (success) {
DiscoveryNode currentMasterNode = this.clusterState().getNodes().getMasterNode();
if (currentMasterNode == null) {
// still no master, start a new election
joinThreadControl.markThreadAsDoneAndStartNew(currentThread);
} else if (!currentMasterNode.equals(masterNode)) {
// master switched, re‑join
joinThreadControl.stopRunningThreadAndRejoin("master_switched_while_finalizing_join");
}
joinThreadControl.markThreadAsDone(currentThread);
} else {
// voting failed, start a new election
joinThreadControl.markThreadAsDoneAndStartNew(currentThread);
}
}
}
}The configuration discovery.zen.minimum_master_nodes (usually nodeCount/2 + 1 ) prevents split‑brain.
Waiting for Other Nodes to Confirm
public void waitToBeElectedAsMaster(int requiredMasterJoins, TimeValue timeValue, final ElectionCallback callback) {
final CountDownLatch done = new CountDownLatch(1);
final ElectionCallback wrapperCallback = new ElectionCallback() {
@Override
public void onElectedAsMaster(ClusterState state) {
done.countDown();
callback.onElectedAsMaster(state);
}
@Override
public void onFailure(Throwable t) {
done.countDown();
callback.onFailure(t);
}
};
// set minimum votes and start election
electionContext.onAttemptToBeElected(requiredMasterJoins, wrapperCallback);
checkPendingJoinsAndElectIfNeeded();
if (done.await(timeValue.millis(), TimeUnit.MILLISECONDS)) {
return;
}
// timeout handling ...
}Selecting a Temporary Master
The findMaster method pings other nodes, collects their view of the current master, and decides the master based on the gathered information:
private DiscoveryNode findMaster() {
List
fullPingResponses = pingAndWait(pingTimeout).toList();
final DiscoveryNode localNode = transportService.getLocalNode();
fullPingResponses.add(new ZenPing.PingResponse(localNode, null, this.clusterState()));
final List
pingResponses = filterPingResponses(fullPingResponses, masterElectionIgnoreNonMasters, logger);
List
activeMasters = new ArrayList<>();
for (ZenPing.PingResponse pingResponse : pingResponses) {
if (pingResponse.master() != null && !localNode.equals(pingResponse.master())) {
activeMasters.add(pingResponse.master());
}
}
List
masterCandidates = new ArrayList<>();
for (ZenPing.PingResponse pingResponse : pingResponses) {
if (pingResponse.node().isMasterNode()) {
masterCandidates.add(new ElectMasterService.MasterCandidate(pingResponse.node(), pingResponse.getClusterStateVersion()));
}
}
if (activeMasters.isEmpty()) {
if (electMaster.hasEnoughCandidates(masterCandidates)) {
ElectMasterService.MasterCandidate winner = electMaster.electMaster(masterCandidates);
return winner.getNode();
} else {
return null; // election failed
}
} else {
return electMaster.tieBreakActiveMasters(activeMasters);
}
}When no active master exists, the algorithm elects the node with the highest cluster state version and, if tied, the smallest node ID.
Ping Mechanism
The pingAndWait method blocks until the configured ping timeout expires, collecting ping responses via a CompletableFuture :
private ZenPing.PingCollection pingAndWait(TimeValue timeout) {
final CompletableFuture
response = new CompletableFuture<>();
try {
zenPing.ping(response::complete, timeout);
} catch (Exception ex) {
response.completeExceptionally(ex);
}
return response.get();
}Unicast pings are sent concurrently using the unicastZenPingExecutorService . The number of concurrent connections is controlled by discovery.zen.ping.unicast.concurrent_connects , and the overall ping timeout by discovery.zen.ping_timeout .
Choosing the Master
After gathering ping results, the algorithm either elects a master from the qualified candidates when the cluster has no master, or breaks ties among active masters by selecting the node with the smallest ID:
public MasterCandidate electMaster(Collection
candidates) {
assert hasEnoughCandidates(candidates);
List
sortedCandidates = new ArrayList<>(candidates);
sortedCandidates.sort(MasterCandidate::compare);
return sortedCandidates.get(0);
} public static int compare(MasterCandidate c1, MasterCandidate c2) {
int ret = Long.compare(c2.clusterStateVersion, c1.clusterStateVersion);
if (ret == 0) {
ret = compareNodes(c1.getNode(), c2.getNode());
}
return ret;
} private static int compareNodes(DiscoveryNode o1, DiscoveryNode o2) {
if (o1.isMasterNode() && !o2.isMasterNode()) return -1;
if (!o1.isMasterNode() && o2.isMasterNode()) return 1;
return o1.getId().compareTo(o2.getId());
}Thus, the master election speed depends largely on the duration of the pingAndWait call, which is bounded by the configured ping timeout.
References
https://zhuanlan.zhihu.com/p/110015509
https://baike.baidu.com/item/霸道选举算法
https://github.com/elastic/elasticsearch
政采云技术
ZCY Technology Team (Zero), based in Hangzhou, is a growth-oriented team passionate about technology and craftsmanship. With around 500 members, we are building comprehensive engineering, project management, and talent development systems. We are committed to innovation and creating a cloud service ecosystem for government and enterprise procurement. We look forward to your joining us.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.