Operations 13 min read

Understanding and Handling ZooKeeper Split‑Brain Issues

This article explains the causes of ZooKeeper split‑brain situations, why odd‑numbered node deployments are preferred, how the quorum (majority) rule prevents split‑brain, and outlines practical methods such as quorum configuration, redundant communication, fencing, and pause‑before‑failover to handle and avoid the issue.

Architecture Digest

Nov 21, 2020

Understanding and Handling ZooKeeper Split‑Brain Issues

ZooKeeper is a distributed coordination service that provides a high‑performance synchronization kernel for building complex distributed functions. Split‑brain (brain‑split) problems occur when a cluster loses network connectivity between its partitions, causing each side to elect its own leader.

Why Deploy an Odd Number of Nodes? ZooKeeper’s fault‑tolerance requires that the number of surviving nodes be greater than half of the total. With an odd number of nodes, the maximum tolerable failures are achieved with fewer resources (e.g., 5 nodes tolerate 2 failures, while 6 nodes would still only tolerate 2 failures, wasting a node).

ZooKeeper’s majority rule states that a leader can be elected only if it receives votes from more than half of the nodes. This prevents split‑brain because a minority partition cannot obtain a majority and therefore cannot elect a leader.

In a multi‑data‑center deployment, if the network between data centers fails, each center may still have internal communication and could each elect a leader if the majority rule were not enforced, leading to two independent “brains”.

The article describes how the majority rule (node count > n/2) ensures that either no leader is elected or exactly one leader exists, thereby avoiding split‑brain.

Handling Split‑Brain

Typical mitigation methods include:

Quorum (majority) configuration – only a majority can elect a leader.

Redundant communication channels – multiple network paths to reduce single‑point failures.

Fencing (shared resource locking) – only the node holding the lock can act as leader.

Arbitration mechanisms – external reference IP or service to break ties.

Disk lock mechanisms – prevent a split‑brain node from accessing shared resources.

Additionally, when a follower detects a leader failure, it should pause for a duration equal to the ZooKeeper timeout before attempting to become leader, ensuring the old leader has time to shut down cleanly.

By employing these strategies, ZooKeeper can maintain high availability while preventing the dangerous consequences of split‑brain, such as data inconsistency and client confusion.

Original source: https://www.cnblogs.com/kevingrace/p/12433503.html

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

high availability Cluster Management quorum Split-Brain

Written by

Architecture Digest

Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.