Databases 20 min read

Mastering HBase: Table Structure, API Usage, and Performance Tuning

This article explains HBase's column‑oriented architecture, key concepts such as Rowkey, ColumnFamily, and Region, provides Java API examples for table operations, and offers practical optimization techniques—including pre‑splitting, Rowkey design, caching, and compaction settings—to improve read/write performance.

Art of Distributed System Architecture Design

May 17, 2015

Mastering HBase: Table Structure, API Usage, and Performance Tuning

HBase Table Overview

HBase is an open‑source, distributed, column‑oriented database inspired by Google’s BigTable, primarily used for storing unstructured data. It relies on HDFS for storage, MapReduce for computation, and ZooKeeper for coordination and failover. Higher‑level tools such as Pig, Hive, and Sqoop provide query and import capabilities.

Data is identified by a unique Rowkey , which determines access order. Three access patterns exist: single‑row lookup, row‑range scan, and full‑table scan. ColumnFamily (schema) groups columns; each column is addressed as ColumnFamily:qualifier. A Cell stores the value as raw bytes, and a 64‑bit Timestamp distinguishes versions, with newer versions sorted first.

Physically, a table is split horizontally into Regions . Each Region contains one or more Stores (one per ColumnFamily). A Store consists of an in‑memory MemStore and persistent HFile files.

HBase API Example

The HBase client library offers a rich set of Java APIs for creating tables, adding columns, inserting data, and querying. The following utility class demonstrates common operations such as checking table existence, creating tables with split keys, disabling/dropping tables, bulk loading, single puts, batch puts, and scanning.

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.HColumnDescriptor;
import org.apache.hadoop.hbase.HTableDescriptor;
import org.apache.hadoop.hbase.KeyValue;
import org.apache.hadoop.hbase.client.Get;
import org.apache.hadoop.hbase.client.HBaseAdmin;
import org.apache.hadoop.hbase.client.HTable;
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.client.Result;
import org.apache.hadoop.hbase.client.ResultScanner;
import org.apache.hadoop.hbase.client.Scan;
import org.apache.hadoop.hbase.util.Bytes;

import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

public class HBaseUtil {
    private Configuration conf = null;
    private HBaseAdmin admin = null;

    protected HBaseUtil(Configuration conf) throws IOException {
        this.conf = conf;
        this.admin = new HBaseAdmin(conf);
    }

    public boolean existsTable(String table) throws IOException {
        return admin.tableExists(table);
    }

    public void createTable(String table, byte[][] splitKeys, String... colfams) throws IOException {
        HTableDescriptor desc = new HTableDescriptor(table);
        for (String cf : colfams) {
            HColumnDescriptor coldef = new HColumnDescriptor(cf);
            desc.addFamily(coldef);
        }
        if (splitKeys != null) {
            admin.createTable(desc, splitKeys);
        } else {
            admin.createTable(desc);
        }
    }
    // ... (other methods omitted for brevity) ...
}

Scanning data involves multiple layers: a RegionScanner aggregates StoreScanner instances, each of which combines a MemStoreScanner and several StoreFileScanner s. The scan process reads stores in order, merges in‑memory and on‑disk data via a heap, and seeks to the desired KeyValue. Common Scan methods include addFamily, addColumn, setMaxVersions, setTimeRange, setFilter, setStartRow, setStopRow, setCaching, and setBatch.

HBase Table Optimization

When concurrency or data volume grows, read/write performance can degrade. Recommended tuning steps:

Pre‑splitting regions : Create empty regions before bulk loading to avoid a single hot region.

Rowkey design : Use lexicographically friendly keys that group frequently accessed rows; for monotonically increasing keys, reverse them to distribute load evenly.

Limit ColumnFamily count : Keep the number of families to 2‑3 to reduce I/O caused by simultaneous flushes.

Cache settings : Set HColumnDescriptor.setInMemory(true) for hot tables.

TTL configuration : Use HColumnDescriptor.setTimeToLive(int) to automatically purge expired data.

WAL control : Disable write‑ahead logging for non‑critical data via Put.setWriteToWAL(false) to improve write speed (risking data loss on failure).

Batch writes : Accumulate multiple Put objects and submit them in a single call to reduce RPC overhead.

Scanner caching : Adjust hbase.client.scanner.caching, HTable.setScannerCaching(), or Scan.setCaching() (the latter has highest priority) to fetch more rows per RPC.

RegionServer handler threads : Tune hbase.regionserver.handler.count based on workload—fewer threads for memory‑intensive big puts, more threads for high‑TPS scenarios.

Region size : Set hbase.hregion.max.filesize (default 256 MB) to a larger value such as 2 GB; larger regions reduce split/compaction frequency but increase per‑operation latency.

Memory recommendations (based on practical testing): HDFS NameNode 16 GB, HDFS DataNode 2 GB, HBase Master 2 GB, HBase RegionServer 16 GB, ZooKeeper 4 GB.

Practical Case: Deleting Data

A project required immediate deletion of rows whose Rowkey consisted of a task ID plus a 16‑byte random suffix. Direct deletion caused long latency and high disk I/O, leading to timeouts. Log analysis revealed that deletions triggered a major compaction, which merges store files and removes obsolete versions. During a prolonged major compaction, the entire region becomes unreadable, causing query timeouts.

The CompactionChecker thread decides whether to run a major compaction based on the hbase.hregion.majorcompaction parameter. Setting this parameter to 0 disables the periodic major compaction. Instead, a custom schedule (e.g., a nightly cron job or a Quartz timer) can trigger compaction during off‑peak hours. The delete workflow was adjusted to remove only the task record immediately; associated HBase rows are deleted later during the scheduled compaction window, eliminating the performance impact without changing HBase’s core configuration.

Conclusion

HBase differs significantly from traditional relational databases in data modeling, query capabilities, and performance characteristics. Understanding its internal structures—Rowkey, ColumnFamily, Region, Store, and MemStore—combined with careful API usage and targeted configuration tuning, enables developers to achieve reliable, high‑throughput data access. Real‑world case studies, such as deferred deletions and controlled compaction, illustrate how thoughtful design mitigates bottlenecks and ensures scalable operation.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data HBase database optimization NoSQL Java API

Written by

Art of Distributed System Architecture Design

Introductions to large-scale distributed system architectures; insights and knowledge sharing on large-scale internet system architecture; front-end web architecture overviews; practical tips and experiences with PHP, JavaScript, Erlang, C/C++ and other languages in large-scale internet system development.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.