Postmortem Analysis of a 10‑Node HBase Cluster Outage and Mitigation Measures
This article presents a detailed post‑mortem of a 10‑node HBase cluster failure caused by excessive region count and memstore pressure, analyzes HDFS and datanode log errors, and outlines configuration adjustments and operational recommendations that restored the service and prevented future outages.
The report describes a severe outage of a 10‑node HBase cluster supporting hundreds of terabytes of data, where the number of regions grew beyond 23,000, leading to frequent memstore flushes, massive HFile creation, and overwhelming HDFS datanodes.
Incident Site
The cluster consisted of ten regionservers, each configured with a 32 GB JVM heap. Over time, the region count reached roughly 2,300 regions per server, and write latency began to increase before the complete crash.
Crash Logs
Key excerpts from the regionserver logs show failures to create new HDFS blocks and broken pipeline connections:
WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception
java.io.IOException: Unable to create new block.
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1633)
...
WARN org.apache.hadoop.hdfs.DFSClient: Could not get block locations. Source file "/hbase/data/db1/.../recovered.edits/0000000000023724243.temp" - Aborting...
INFO org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream
java.io.IOException: Bad connect ack with firstBadLink as xx.xx.xx.xx:50010
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1722)
...Datanode logs reveal EOFExceptions and Xceiver count limits being exceeded:
INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: ...
java.io.EOFException: Premature EOF: no length prefix available
at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2272)
...
WARN org.apache.hadoop.hdfs.server.datanode.DataNode: {host}:50010:DataXceiverServer:
java.io.IOException: Xceiver count 4097 exceeds the limit of concurrent xcievers: 4096
...Root Cause Analysis
HBase stores data on HDFS; each region corresponds to many HFiles. The logs indicate that HBase flush operations were extremely frequent because memstore size repeatedly reached the global memstore threshold, triggering forced flushes. With over 23,000 regions, each regionserver’s heap could allocate only a small memstore, causing flushes after roughly 10 MB of data and generating a huge number of small HFiles.
These small files triggered continuous minor compactions, further stressing HDFS. Datanodes could not keep up, hitting the Xceiver thread limit (default 4096), which manifested as the observed exceptions and ultimately caused the cluster crash.
Recovery Steps
Several HBase and Hadoop parameters were tuned to increase concurrency and memory capacity:
dfs.datanode.max.transfer.threads – increased from the default 4096 to 32768 (recommended to start with 2–4× increase).
dfs.datanode.handler.count – raised from 3 to 8 (default Hadoop value is 10).
hbase.regionserver.thread.compaction.small – increased from 1 to 5 to accelerate minor compactions.
hbase.regionserver.global.memstore.size.lower.limit – raised from 0.38 to 0.95, reducing premature flushes.
RegionServer JVM heap – expanded from 32 GB to 64 GB to provide more memstore space.
After applying these changes, HDFS was restarted (a lengthy operation due to the large number of blocks), followed by HBase. The cluster returned to normal operation, and additional architectural adjustments were made, such as moving some data directly onto HDFS and planning for cluster scaling.
Master Initialization Timeout
During the HBase restart, the active Master failed to initialize within the default 900 000 ms timeout because assigning the massive number of regions took longer than expected. The namespace table also timed out after 300 000 ms.
Configuration adjustments resolved the issue:
<name>hbase.master.namespace.init.timeout</name>
<value>86400000</value>
<name>hbase.master.initializationmonitor.timeout</name>
<value>86400000</value>
<name>hbase.bulk.assignment.waiton.empty.rit</name>
<value>3600000</value>
<name>hbase.bulk.assignment.perregion.open.time</name>
<value>30000</value>Relevant source code snippets from HMaster show the timeout constants and monitor implementation.
Summary
The outage was a classic case of a small HBase cluster being overloaded by excessive region count and high‑frequency writes, leading to memstore pressure, frequent flushes, minor compactions, and ultimately HDFS thread‑pool exhaustion. Proper capacity planning, parameter tuning, and operational monitoring are essential to prevent similar failures.
Big Data Technology Architecture
Exploring Open Source Big Data and AI Technologies
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.