Databases 16 min read

Postmortem Analysis of a 10‑Node HBase Cluster Outage and Mitigation Measures

This article presents a detailed post‑mortem of a 10‑node HBase cluster failure caused by excessive region count and memstore pressure, analyzes HDFS and datanode log errors, and outlines configuration adjustments and operational recommendations that restored the service and prevented future outages.

Big Data Technology Architecture
Big Data Technology Architecture
Big Data Technology Architecture
Postmortem Analysis of a 10‑Node HBase Cluster Outage and Mitigation Measures

The report describes a severe outage of a 10‑node HBase cluster supporting hundreds of terabytes of data, where the number of regions grew beyond 23,000, leading to frequent memstore flushes, massive HFile creation, and overwhelming HDFS datanodes.

Incident Site

The cluster consisted of ten regionservers, each configured with a 32 GB JVM heap. Over time, the region count reached roughly 2,300 regions per server, and write latency began to increase before the complete crash.

Crash Logs

Key excerpts from the regionserver logs show failures to create new HDFS blocks and broken pipeline connections:

WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception
java.io.IOException: Unable to create new block.
    at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1633)
    ...
WARN org.apache.hadoop.hdfs.DFSClient: Could not get block locations. Source file "/hbase/data/db1/.../recovered.edits/0000000000023724243.temp" - Aborting...
INFO org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream
java.io.IOException: Bad connect ack with firstBadLink as xx.xx.xx.xx:50010
    at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1722)
    ...

Datanode logs reveal EOFExceptions and Xceiver count limits being exceeded:

INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: ...
java.io.EOFException: Premature EOF: no length prefix available
    at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2272)
    ...
WARN org.apache.hadoop.hdfs.server.datanode.DataNode: {host}:50010:DataXceiverServer:
java.io.IOException: Xceiver count 4097 exceeds the limit of concurrent xcievers: 4096
    ...

Root Cause Analysis

HBase stores data on HDFS; each region corresponds to many HFiles. The logs indicate that HBase flush operations were extremely frequent because memstore size repeatedly reached the global memstore threshold, triggering forced flushes. With over 23,000 regions, each regionserver’s heap could allocate only a small memstore, causing flushes after roughly 10 MB of data and generating a huge number of small HFiles.

These small files triggered continuous minor compactions, further stressing HDFS. Datanodes could not keep up, hitting the Xceiver thread limit (default 4096), which manifested as the observed exceptions and ultimately caused the cluster crash.

Recovery Steps

Several HBase and Hadoop parameters were tuned to increase concurrency and memory capacity:

dfs.datanode.max.transfer.threads – increased from the default 4096 to 32768 (recommended to start with 2–4× increase).

dfs.datanode.handler.count – raised from 3 to 8 (default Hadoop value is 10).

hbase.regionserver.thread.compaction.small – increased from 1 to 5 to accelerate minor compactions.

hbase.regionserver.global.memstore.size.lower.limit – raised from 0.38 to 0.95, reducing premature flushes.

RegionServer JVM heap – expanded from 32 GB to 64 GB to provide more memstore space.

After applying these changes, HDFS was restarted (a lengthy operation due to the large number of blocks), followed by HBase. The cluster returned to normal operation, and additional architectural adjustments were made, such as moving some data directly onto HDFS and planning for cluster scaling.

Master Initialization Timeout

During the HBase restart, the active Master failed to initialize within the default 900 000 ms timeout because assigning the massive number of regions took longer than expected. The namespace table also timed out after 300 000 ms.

Configuration adjustments resolved the issue:

<name>hbase.master.namespace.init.timeout</name>
<value>86400000</value>
<name>hbase.master.initializationmonitor.timeout</name>
<value>86400000</value>
<name>hbase.bulk.assignment.waiton.empty.rit</name>
<value>3600000</value>
<name>hbase.bulk.assignment.perregion.open.time</name>
<value>30000</value>

Relevant source code snippets from HMaster show the timeout constants and monitor implementation.

Summary

The outage was a classic case of a small HBase cluster being overloaded by excessive region count and high‑frequency writes, leading to memstore pressure, frequent flushes, minor compactions, and ultimately HDFS thread‑pool exhaustion. Proper capacity planning, parameter tuning, and operational monitoring are essential to prevent similar failures.

Big DataCompactionHBaseparameter tuningHDFSMemstoreCluster Outage
Big Data Technology Architecture
Written by

Big Data Technology Architecture

Exploring Open Source Big Data and AI Technologies

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.