Operations 11 min read

Investigation of Zookeeper 3.4.6 Election Port (3888) Failure Caused by Malformed Packets

This article details a troubleshooting investigation of a Zookeeper 3.4.6 cluster where the election port 3888 became unresponsive due to a NegativeArraySizeException triggered by malformed packets, explains the diagnostic steps, root‑cause analysis, and recommends upgrading to a newer version to fix the issue.

Zhuanzhuan Tech
Zhuanzhuan Tech
Zhuanzhuan Tech
Investigation of Zookeeper 3.4.6 Election Port (3888) Failure Caused by Malformed Packets

0 Conclusion

Conclusion first: Zookeeper versions 3.4.6 and below have a critical vulnerability where the election port (default 3888) can become ineffective, preventing leader election. Related issues are:

issue-3016: Follower QuorumCnxManager$Listener thread died due to incorrect client packet

issue-2186: QuorumCnxManager#receiveConnection may crash with random input

In short, Zookeeper's election port 3888 may throw a NegativeArraySizeException when it receives malformed packets, causing the listener thread QuorumCnxManager$Listener to exit and thus halting leader election. The cluster can still read/write, but a restarted node cannot re‑join.

1 Problem Background

On December 20, during a routine check, the operations team noticed that two Zookeeper nodes could not join the cluster. A conversation between the operator ("Big C") and the author ("I") is reproduced below. Big C: "Our ZK cluster has two nodes that cannot join." I : "Five are down, two remain, if one more fails the whole cluster will be down! What happened?" Big C : "We didn't change any configuration, just a normal restart and now they can't join." I : "Impossible, we have no experience with ZK!" Big C : "ZK is written in Java, we ops rarely use Java, you architects should know better." Motivated to investigate, the author began the analysis. 2 Phenomenon 2.1 Environment Configuration The zoo.cfg of the five‑node cluster is as follows: <code>server.6=10.40.xx.81:2888:3888 server.7=10.40.xx.41:2888:3888 server.8=10.40.xx.51:2888:3888 server.9=10.40.xx.111:2888:3888 server.10=10.40.xx.121:2888:3888 </code> 2.2 Observed Symptoms Node 6 and node 8 have been restarted but cannot re‑join. Nodes 7, 9 and 10 have not been restarted; they continue to read/write, with node 10 acting as the Leader . 2.2.1 Restarted Nodes Node 6 logs show an error indicating it cannot communicate with the 3888 port of the non‑restarted nodes for election. Running zkCli.sh on node 6 also fails, confirming the node is isolated. 2.2.2 Non‑restarted Nodes Node 10 can still read/write via zkCli.sh , confirming its client functionality is intact. 3 Investigation Process 3.1 Status of Port 3888 on Node 10 Using netstat on node 10 shows the 3888 port is LISTENING but has many CLOSE_WAIT connections, many originating from IPs starting with 10.177 (identified later as a security scanner). Although the port is LISTENING, telnet cannot establish a connection; netstat shows many SYN_SENT states, meaning SYN packets are sent but ACKs never arrive, indicating the listener thread is effectively dead. 3.2 jstack of Node 10 Comparing the jstack of node 10 with a colleague’s ZK cluster revealed that the QuorumCnxManager$Listener thread was missing, which is responsible for accepting election requests on port 3888. 3.3 Port 3888 on Node 6 Also Down Node 6 exhibits the same CLOSE_WAIT pattern. Its logs contain a NegativeArraySizeException , which caused the listener thread to terminate. After restarting node 6, the thread reappears. 4 Root‑Cause Analysis 4.1 Online Verification The code path in QuorumCnxManager$Listener reads an int named num_remaining_bytes from the incoming packet and allocates a byte array of that size. When a malformed packet supplies a negative value, Java throws NegativeArraySizeException , terminating the listener thread. The malformed packets were generated by a security‑scanning tool, which sent random data to the election port, causing the failure. 4.2 Fix in Newer Versions Issue‑2186 indicates that Zookeeper 3.4.7 adds a guard that validates num_remaining_bytes before allocating the array, eliminating the exception. The current cluster runs 3.4.6, which lacks this protection. 5 Summary Encountering unfamiliar problems is an opportunity for growth; thorough investigation is essential. The most valuable outcome is the troubleshooting methodology, not just the fix. Seek help from knowledgeable peers when needed; the ITCP community was instrumental in this case. Even robust‑looking components can be fragile under unexpected inputs. Recommendation: upgrade Zookeeper to 3.4.7 or later to avoid the election‑port crash caused by malformed packets. References [1] https://issues.apache.org/jira/browse/ZOOKEEPER-3016 [2] https://issues.apache.org/jira/browse/ZOOKEEPER-2186

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

OperationsZooKeeperApacheZookeeperClusterTroubleshootingElectionPortNegativeArraySizeException
Zhuanzhuan Tech
Written by

Zhuanzhuan Tech

A platform for Zhuanzhuan R&D and industry peers to learn and exchange technology, regularly sharing frontline experience and cutting‑edge topics. We welcome practical discussions and sharing; contact waterystone with any questions.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.