HBase Read and Write Performance Optimization Guide
This guide details practical server‑side and client‑side techniques for improving HBase read and write throughput, covering rowkey design, BlockCache configuration, HFile management, compaction tuning, scan cache sizing, bulkload usage, WAL policies, and SSD storage options.
HBase Read Performance Optimization
1. Server‑side tuning
1.1 Balance read requests – For high‑throughput workloads, hash the rowkey and pre‑split tables; use hash pre‑splits for get‑heavy workloads and design rowkeys that distribute data evenly for scan‑heavy workloads.
1.2 BlockCache configuration – Use LRUBlockCache when JVM heap < 20 GB; otherwise choose BucketCache in off‑heap mode. In heap mode, allocate roughly 40 % of heap to read/write cache.
1.3 Limit HFile count – Excessive HFiles increase I/O latency. Adjust hbase.hstore.compactionThreshold (default 3) and hbase.hstore.compaction.max.size to control compaction behavior.
1.4 Control compaction resource usage – Disable automatic major compaction for large regions (>100 GB) and trigger it manually during low‑traffic periods; limit compaction throughput for smaller regions.
1.5 Improve data locality – Avoid unnecessary region migrations; for nodes with low locality, run major compaction during off‑peak times.
2. Client‑side tuning
2.1 Scan cache size – Increase scan cache from the default 100 to 500–1000 for large scans to reduce RPC calls.
2.2 Batch get – Use the batch get API to lower RPC overhead and boost read throughput.
2.3 Specify column families or columns – Restrict queries to needed column families/columns for more precise reads.
2.4 Disable cache for offline bulk scans – Set scan.setCacheBlocks(false) for full‑table offline scans (e.g., MapReduce) to avoid cache thrashing.
3. Column‑family design – Enable Bloom filters (row or rowcol) to accelerate lookups, especially when many columns are accessed.
HBase Write Performance Optimization
1. Server‑side tuning
1.1 Sufficient region count – Ensure the number of regions exceeds the number of RegionServer nodes; split hot regions and redistribute them for balanced load.
1.2 Write request balancing – Design rowkeys and pre‑split strategies to evenly distribute writes (hash pre‑splits for get‑heavy tables, range pre‑splits for scan‑heavy tables).
1.3 Use SSD for WAL – Store WAL files on SSD to dramatically improve write latency. Example configuration:
<property>
<name>hbase.wal.storage.policy</name>
<value>ONE_SSD</value>
</property>Set hbase.wal.storage.policy to ONE_SSD (one replica on SSD) or ALLSSD (all replicas on SSD).
2. Client‑side tuning
2.1 Bulkload – Use the MapReduce bulkload tool to generate HFiles for massive offline data imports; offers high throughput but limited real‑time capability.
2.2 WAL usage and sync – Disable WAL (SKIP_WAL) only if occasional data loss is acceptable; otherwise choose between ASYNC_WAL , SYNC_WAL (default), and FSYNC_WAL based on durability needs.
SKIP_WAL
ASYNC_WAL
SYNC_WAL
FSYNC_WAL2.3 Synchronous batch puts – Similar to batch gets, grouping puts reduces RPC calls.
2.4 Asynchronous batch puts – Enable setAutoFlush(false) and let the client buffer (default 2 MB) before flushing, increasing throughput at the risk of data loss on client failure.
2.5 KeyValue size – Keep rowkey length below 100 bytes (max 64 KB) and avoid oversized values; split large values across columns or store them in HDFS and reference via URL.
Big Data Technology Architecture
Exploring Open Source Big Data and AI Technologies
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.