Resolving Zookeeper and HBase Master Crash Caused by jute.maxbuffer Misconfiguration
The article details a step‑by‑step investigation of a Zookeeper outage and subsequent HBase master failure caused by an outdated Zookeeper version bug and an excessively large jute.maxbuffer setting, explaining how to identify the issue, adjust configurations, and improve region assignment performance.
While about to leave work, the author received alerts that Zookeeper and HBase were down, prompting an urgent investigation.
The master log showed that the Zookeeper node /Hbase/replication/rs could not be retrieved. Zookeeper logs indicated a complete disconnection and paralysis.
Research revealed that the exception java.nio.channels.CancelledKeyException was a known bug in Zookeeper version 3.4.10; the author was using version 3.4.8, which suffered from this issue.
After restarting all Zookeeper nodes and HBase, a new error appeared: java.io.IOException: Packet len6075380 is out of range !, causing the master to exit.
Following previous experience, the author modified the JVM option -Djute.maxbuffer=41943040, but the error persisted even after repeatedly increasing the value, eventually setting it to an absurdly large number (hundreds of billions), which still failed.
Inspecting the Zookeeper source, the method readLength() throws an IOException when the packet length exceeds packetLen. The default packetLen is 4 MB (4194304) and must be an int value; the author’s configuration exceeded the integer range, making it invalid.
After correcting the configuration to a 10 MB value and restarting Zookeeper and deleting the /hbase node, the system started successfully.
However, merely adjusting this parameter does not solve the root problem. Given the cluster’s massive data volume (over 200 billion records) and region count (over 100 k), restarting the master is a heavyweight operation because region state changes are managed by Zookeeper. Slow region assignment can cause some /hbase nodes to store excessive data, leading to further exceptions.
To mitigate this, the author suggests two actions: (1) rationally partition regions, and (2) increase the thread count for region assignment by setting hbase.assignment.threads.max (default 30) to a higher value, e.g., 100, as shown in the configuration snippet.
After applying the new thread configuration and restarting HBase, the system started smoothly.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology Architecture
Exploring Open Source Big Data and AI Technologies
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
