Databases 20 min read

JVM Tuning, Region Split, BlockCache, and Compaction Strategies for HBase

This article explains how to configure JVM memory, choose appropriate garbage‑collector settings, tune HBase region split policies, optimize BlockCache implementations, and select suitable compaction strategies to improve HBase performance on clusters of various sizes.

Big Data Technology Architecture
Big Data Technology Architecture
Big Data Technology Architecture
JVM Tuning, Region Split, BlockCache, and Compaction Strategies for HBase

JVM Tuning

In a default HBase installation the Master and RegionServer each receive only 1 GB of heap memory, while Memstore consumes 0.4 GB, which is insufficient for most RegionServers.

export HBASE_MASTER_OPTS="$HBASE_MASTER_OPTS -Xms2g -Xmx2g"
export HBASE_REGIONSERVER_OPTS="$HBASE_REGIONSERVER_OPTS -Xms8g -Xmx8g"

This example is illustrative; actual clusters may require different settings, and at least 10 % of the machine memory should be reserved for the operating system.

For a 16 GB node running MapReduce, RegionServer and DataNode, a typical allocation could be:

2 GB for system processes.

8 GB for MapReduce (approximately 6 map slots and 2 reduce slots per GB).

4 GB for HBase RegionServer.

1 GB for TaskTracker.

1 GB for DataNode.

If MapReduce is not running, the RegionServer memory can be reduced roughly by half.

Full GC Tuning

Memory pressure usually appears on RegionServer rather than Master. HBase supports four garbage collectors:

SerialGC

ParallelGC (default for JDK 8, optimised for the young generation)

Concurrent Mark‑Sweep (CMS) – optimised for the old generation

G1GC – optimised for large heaps (≥ 32 GB)

Two common combinations are:

ParallelGC + CMS export HBASE_REGIONSERVER_OPTS="$HBASE_REGIONSERVER_OPTS -Xms8g -Xmx8g -XX:+UseParNewGC -XX:+UseConcMarkSweepGC"

G1GC export HBASE_REGIONSERVER_OPTS="$HBASE_REGIONSERVER_OPTS -Xms8g -Xmx8g -XX:+UseG1GC -XX:MaxGCPauseMillis=100"

Choose G1GC only for very large heaps (32–64 GB). For heaps < 4 GB use the ParallelGC+CMS combo; for 4–32 GB test both and add the diagnostic flags:

-XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintAdaptiveSizePolicy

MSLAB and In‑Memory Compaction (HBase 2.x)

HBase implements a Memstore‑local allocation buffer (MSLAB) to reduce allocation overhead. Relevant parameters include hbase.hregion.memstore.mslab.enabled , hbase.hregion.memstore.mslab.chunksize , hbase.hregion.memstore.mslab.max.allocation , and hbase.hregion.memstore.chunkpool.maxsize .

In‑Memory Compaction (available from HBase 2.0) further improves write throughput. It can be enabled with:

hbase.hregion.compacting.memstore.type=BASIC # options: NONE/BASIC/EAGER

Region Auto‑Split

Regions can be split automatically or manually. Several split policies exist:

ConstantSizeRegionSplitPolicy

Uses hbase.hregion.max.filesize to define a fixed size threshold; exceeding it triggers a split.

IncreasingToUpperBoundRegionSplitPolicy (default)

Calculates a dynamic size limit with the formula:

Math.min(tableRegionCount^3 * initialSize, defaultRegionMaxFileSize)

where tableRegionCount is the total number of regions for the table, initialSize defaults to twice the Memstore flush size, and defaultRegionMaxFileSize is the value from the previous policy.

KeyPrefixRegionSplitPolicy

Ensures rows sharing the same prefix stay in the same region by using the KeyPrefixRegionSplitPolicy.prefix_length parameter.

DelimitedKeyPrefixRegionSplitPolicy

Similar to the previous policy but splits based on a delimiter (e.g., ‘_’) rather than a fixed‑length prefix.

BusyRegionSplitPolicy

Targets hot regions by splitting them, useful when hotspot mitigation is required but introduces nondeterministic split locations.

DisabledRegionSplitPolicy

Disables automatic splitting; manual split points can be defined in advance.

Manual splitting can be performed via pre‑splitting or forced splits.

Recommended Approach

Start with pre‑splitting to load initial data, then let HBase manage automatic splitting. Do not disable automatic splitting.

BlockCache Optimization

Each RegionServer has a single BlockCache. Reads first check the BlockCache; on a miss they fall back to HFile or Memstore. The cache is enabled by default.

To disable BlockCache for a column family:

alter 'testTable', CONFIGURATION=>{NAME=>'cf',BLOCKCACHE=>'false'}

Cache implementations:

LRUBlockCache – three regions (similar to JVM young, old, permanent generations).

SlabCache – now deprecated.

BucketCache – can use heap, off‑heap, or file storage; default is off‑heap.

BucketCache configuration includes hbase.bucketcache.ioengine , hbase.bucketcache.combinedcache.enabled , hbase.bucketcache.size , and hbase.bucketcache.bucket.sizes . The JVM flag -XX:MaxDirectMemorySize must be larger than the BucketCache size.

CombinedBlockCache places index and bloom blocks in LRUCache and data blocks in BucketCache, forming a two‑level cache hierarchy (memory → SSD → disk).

HFile Compaction

Two compaction types exist:

Minor Compaction – merges several HFiles, removes expired data (TTL), but retains manually deleted cells.

Major Compaction – merges all HFiles, removes manually deleted cells and excess versions; runs by default every 7 days.

Compaction Policies

RatioBasedCompactionPolicy

Deprecated due to excessive merging; selects files whose size is less than 1.2 × the total size of newer files.

ExploringCompactionPolicy (default since 0.96)

Chooses files where fileSize < (totalSize‑fileSize) * 1.2 . Files smaller than hbase.hstore.compaction.min.size are always selected.

FIFOCompactionPolicy

Deletes fully expired HFiles without merging; unsuitable when tables have no TTL or have a non‑zero MIN_VERSIONS.

DateTieredCompactionPolicy

Groups files by age windows (default 6 h) and merges within each tier, ideal for workloads that read recent data frequently.

StripeCompactionPolicy

Divides a region into many sub‑regions (stripes); works best for large regions (> 2 GB) with uniformly distributed rowkeys.

Summary of Policy Selection

If data has a short TTL and recent data is accessed most, use DateTieredCompactionPolicy.

If data lacks TTL or has a long TTL, StripeCompactionPolicy provides more stable performance.

FIFOCompactionPolicy is rarely needed; consider DateTieredCompactionPolicy first.

Overall, choose the split and compaction policies that match your data distribution, access patterns, and cluster size to achieve optimal HBase performance.

JVMCompactionHBaseDatabase PerformanceMemory TuningBlockCacheRegion Split
Big Data Technology Architecture
Written by

Big Data Technology Architecture

Exploring Open Source Big Data and AI Technologies

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.