JVM Tuning, Region Split, BlockCache, and Compaction Strategies for HBase
This article explains how to configure JVM memory, choose appropriate garbage‑collector settings, tune HBase region split policies, optimize BlockCache implementations, and select suitable compaction strategies to improve HBase performance on clusters of various sizes.
JVM Tuning
In a default HBase installation the Master and RegionServer each receive only 1 GB of heap memory, while Memstore consumes 0.4 GB, which is insufficient for most RegionServers.
export HBASE_MASTER_OPTS="$HBASE_MASTER_OPTS -Xms2g -Xmx2g"
export HBASE_REGIONSERVER_OPTS="$HBASE_REGIONSERVER_OPTS -Xms8g -Xmx8g"This example is illustrative; actual clusters may require different settings, and at least 10 % of the machine memory should be reserved for the operating system.
For a 16 GB node running MapReduce, RegionServer and DataNode, a typical allocation could be:
2 GB for system processes.
8 GB for MapReduce (approximately 6 map slots and 2 reduce slots per GB).
4 GB for HBase RegionServer.
1 GB for TaskTracker.
1 GB for DataNode.
If MapReduce is not running, the RegionServer memory can be reduced roughly by half.
Full GC Tuning
Memory pressure usually appears on RegionServer rather than Master. HBase supports four garbage collectors:
SerialGC
ParallelGC (default for JDK 8, optimised for the young generation)
Concurrent Mark‑Sweep (CMS) – optimised for the old generation
G1GC – optimised for large heaps (≥ 32 GB)
Two common combinations are:
ParallelGC + CMS export HBASE_REGIONSERVER_OPTS="$HBASE_REGIONSERVER_OPTS -Xms8g -Xmx8g -XX:+UseParNewGC -XX:+UseConcMarkSweepGC"
G1GC export HBASE_REGIONSERVER_OPTS="$HBASE_REGIONSERVER_OPTS -Xms8g -Xmx8g -XX:+UseG1GC -XX:MaxGCPauseMillis=100"
Choose G1GC only for very large heaps (32–64 GB). For heaps < 4 GB use the ParallelGC+CMS combo; for 4–32 GB test both and add the diagnostic flags:
-XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintAdaptiveSizePolicy
MSLAB and In‑Memory Compaction (HBase 2.x)
HBase implements a Memstore‑local allocation buffer (MSLAB) to reduce allocation overhead. Relevant parameters include hbase.hregion.memstore.mslab.enabled , hbase.hregion.memstore.mslab.chunksize , hbase.hregion.memstore.mslab.max.allocation , and hbase.hregion.memstore.chunkpool.maxsize .
In‑Memory Compaction (available from HBase 2.0) further improves write throughput. It can be enabled with:
hbase.hregion.compacting.memstore.type=BASIC # options: NONE/BASIC/EAGER
Region Auto‑Split
Regions can be split automatically or manually. Several split policies exist:
ConstantSizeRegionSplitPolicy
Uses hbase.hregion.max.filesize to define a fixed size threshold; exceeding it triggers a split.
IncreasingToUpperBoundRegionSplitPolicy (default)
Calculates a dynamic size limit with the formula:
Math.min(tableRegionCount^3 * initialSize, defaultRegionMaxFileSize)
where tableRegionCount is the total number of regions for the table, initialSize defaults to twice the Memstore flush size, and defaultRegionMaxFileSize is the value from the previous policy.
KeyPrefixRegionSplitPolicy
Ensures rows sharing the same prefix stay in the same region by using the KeyPrefixRegionSplitPolicy.prefix_length parameter.
DelimitedKeyPrefixRegionSplitPolicy
Similar to the previous policy but splits based on a delimiter (e.g., ‘_’) rather than a fixed‑length prefix.
BusyRegionSplitPolicy
Targets hot regions by splitting them, useful when hotspot mitigation is required but introduces nondeterministic split locations.
DisabledRegionSplitPolicy
Disables automatic splitting; manual split points can be defined in advance.
Manual splitting can be performed via pre‑splitting or forced splits.
Recommended Approach
Start with pre‑splitting to load initial data, then let HBase manage automatic splitting. Do not disable automatic splitting.
BlockCache Optimization
Each RegionServer has a single BlockCache. Reads first check the BlockCache; on a miss they fall back to HFile or Memstore. The cache is enabled by default.
To disable BlockCache for a column family:
alter 'testTable', CONFIGURATION=>{NAME=>'cf',BLOCKCACHE=>'false'}
Cache implementations:
LRUBlockCache – three regions (similar to JVM young, old, permanent generations).
SlabCache – now deprecated.
BucketCache – can use heap, off‑heap, or file storage; default is off‑heap.
BucketCache configuration includes hbase.bucketcache.ioengine , hbase.bucketcache.combinedcache.enabled , hbase.bucketcache.size , and hbase.bucketcache.bucket.sizes . The JVM flag -XX:MaxDirectMemorySize must be larger than the BucketCache size.
CombinedBlockCache places index and bloom blocks in LRUCache and data blocks in BucketCache, forming a two‑level cache hierarchy (memory → SSD → disk).
HFile Compaction
Two compaction types exist:
Minor Compaction – merges several HFiles, removes expired data (TTL), but retains manually deleted cells.
Major Compaction – merges all HFiles, removes manually deleted cells and excess versions; runs by default every 7 days.
Compaction Policies
RatioBasedCompactionPolicy
Deprecated due to excessive merging; selects files whose size is less than 1.2 × the total size of newer files.
ExploringCompactionPolicy (default since 0.96)
Chooses files where fileSize < (totalSize‑fileSize) * 1.2 . Files smaller than hbase.hstore.compaction.min.size are always selected.
FIFOCompactionPolicy
Deletes fully expired HFiles without merging; unsuitable when tables have no TTL or have a non‑zero MIN_VERSIONS.
DateTieredCompactionPolicy
Groups files by age windows (default 6 h) and merges within each tier, ideal for workloads that read recent data frequently.
StripeCompactionPolicy
Divides a region into many sub‑regions (stripes); works best for large regions (> 2 GB) with uniformly distributed rowkeys.
Summary of Policy Selection
If data has a short TTL and recent data is accessed most, use DateTieredCompactionPolicy.
If data lacks TTL or has a long TTL, StripeCompactionPolicy provides more stable performance.
FIFOCompactionPolicy is rarely needed; consider DateTieredCompactionPolicy first.
Overall, choose the split and compaction policies that match your data distribution, access patterns, and cluster size to achieve optimal HBase performance.
Big Data Technology Architecture
Exploring Open Source Big Data and AI Technologies
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.