Operations 10 min read

How to Diagnose and Fix a Dual‑Leader ZooKeeper Cluster

This article walks through a real‑world ZooKeeper incident where a five‑node cluster showed two leaders, explains the election rules, analyzes log and configuration mismatches, assesses business impact, and provides a step‑by‑step recovery plan to restore normal service without data loss.

dbaplus Community
dbaplus Community
dbaplus Community
How to Diagnose and Fix a Dual‑Leader ZooKeeper Cluster

1. Problem Background

A recently taken‑over ZooKeeper cluster with five nodes exhibited two leader nodes simultaneously. The cluster served multiple critical business systems, so the issue required a solution that restored service while preserving all data.

Cluster topology (IP addresses anonymized) was a three‑datacenter, disaster‑recovery deployment.

2. ZooKeeper Election Principles

ZooKeeper selects a leader based on three rules:

More than half of the nodes must be up for the cluster to operate.

During election, nodes with smaller myid vote for nodes with larger myid until a leader is chosen.

After a leader is elected, all other nodes become followers.

For a five‑node cluster (myid 1‑5) the election proceeds as nodes start one by one, eventually reaching a stable state with one leader and four followers.

3. Issue Analysis

Log inspection on node 4 (IP 192.176.238.219) showed it had a quorum of supporters (nodes 1, 2, 4) and reported LEADING status, while the other nodes reported FOLLOWING. Data comparison revealed that nodes 1, 2, 3, 5 shared identical data, whereas node 4’s data differed.

Further investigation uncovered that node 4’s internal communication (election) ports differed from the rest of the cluster, causing it to belong to a separate ZooKeeper instance that used ports 2888:3888 instead of the standard 2888:3888 configuration.

4. Impact

Two business groups (A and B) originally accessed separate clusters on ports 2181 and 2182. Because the misconfigured cluster (cluster 2) could serve both ports, any data loss in this cluster would affect both groups, potentially causing widespread service disruption.

5. Resolution Steps

Back up snapshots and transaction logs of clusters 1, 2, and 3.

Extract data from cluster 2 and classify it by business type.

Shut down node 4 of cluster 2 (192.176.238.219:2181). With only two nodes left, the cluster loses quorum and triggers a new election; nodes 1 and 2 then join cluster 3 as followers, forming a five‑node cluster.

Change node 4’s internal election ports to 2889:3889 to avoid conflict.

Restart the instance on 192.176.238.219:2181; it now matches cluster 1’s configuration and joins as a follower, completing a five‑node cluster.

Re‑import the previously classified data into the appropriate clusters (cluster 1 for business A, cluster 3 for business B).

Verify cluster health, data integrity, and perform business‑level testing.

After these actions, all clusters returned to normal operation, and no data loss was observed.

6. Conclusion

The root cause was a simple configuration mismatch, but because the system had been running for a long time, the error was not initially suspected. Restarting the cluster alone would have masked the problem while risking data loss. Proper diagnosis, careful backup, and systematic recovery are essential to avoid recurring failures.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Operationshigh availabilityZooKeeperClustertroubleshooting
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.