Operations 10 min read

Troubleshooting and Recovery of ZooKeeper Election Port Failure in a Codis Cache Cluster

When adding a ZooKeeper observer to a Codis cache cluster, the election port (3888) was unreachable because the QuorumCnxManager listener thread vanished, prompting telnet and log checks, and leading to a successful recovery by rolling upgrade to ZooKeeper 3.4.13, rebuilding the data directory, performing a rolling restart, and decommissioning the temporary node, thereby restoring full cluster quorum and normal Codis‑Proxy operation.

Sohu Tech Products
Sohu Tech Products
Sohu Tech Products
Troubleshooting and Recovery of ZooKeeper Election Port Failure in a Codis Cache Cluster

The author describes a production incident where adding a ZooKeeper observer node failed due to the election port (3888) being unreachable, while the existing cluster remained operational.

Initial troubleshooting included telnet checks, log examination, and monitoring, which revealed that the QuorumCnxManager$Listener thread responsible for accepting election connections was missing on the non‑restarted nodes.

Through external community assistance, the root cause was identified: the listener thread had disappeared, causing the election service to stop and preventing new nodes from joining the quorum.

The article then outlines two recovery strategies tested in a staging environment: (1) an in‑place rolling upgrade to ZooKeeper 3.4.13, and (2) building a new three‑node cluster from the offline nodes and later reintegrating the remaining original nodes.

Key validation steps covered version compatibility, Codis‑Proxy temporary node registration, multi‑client behavior, and log rotation configuration using log4j.

The chosen solution involved preparing a new ZooKeeper 3.4.13 working directory, updating node IDs, copying data and transaction logs, performing a rolling restart, and finally decommissioning the temporary node to restore a five‑node cluster.

Post‑recovery verification showed the cluster leader state, synced followers, and normal Codis‑Proxy connections, confirming successful restoration and log rotation.

The author concludes with lessons: ensure election port health checks, favor version upgrades for bug fixes, validate recovery plans rigorously, and always have a rollback plan.

Relevant code snippets: telnet leader 3888 and echo mntr | nc localhost 2181 .

ZookeeperVersion UpgradeCluster Recoveryelection portlog rotationQuorumCnxManager
Sohu Tech Products
Written by

Sohu Tech Products

A knowledge-sharing platform for Sohu's technology products. As a leading Chinese internet brand with media, video, search, and gaming services and over 700 million users, Sohu continuously drives tech innovation and practice. We’ll share practical insights and tech news here.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.