Databases 16 min read

Understanding HBase: Architecture, Data Model, and Read/Write Mechanics

This article provides a comprehensive overview of HBase, covering its column‑oriented design, core components such as HMaster, RegionServer and ZooKeeper, the data model with column families and row keys, and detailed step‑by‑step write and read processes for distributed storage.

JavaEdge
JavaEdge
JavaEdge
Understanding HBase: Architecture, Data Model, and Read/Write Mechanics

Overview

HBase is an open‑source, column‑oriented distributed NoSQL database modeled after Google Bigtable. It stores data on HDFS and uses ZooKeeper for cluster coordination, providing scalable, high‑availability key‑value storage.

Architecture Components

HMaster

Manages RegionServers, performs load balancing and region assignment.

Handles namespace, table metadata (stored in HDFS) and ACLs.

Coordinates region splits and failover via ZooKeeper.

RegionServer

Hosts one or more Regions; reads and writes data to HDFS.

Writes go through Write‑Ahead Log (WAL) on HDFS, then to an in‑memory MemStore.

When MemStore reaches a configurable size (default 64 MiB) it flushes to a StoreFile (HFile) on HDFS.

Manages StoreFiles, compaction, and region splits.

ZooKeeper

Stores cluster metadata, configuration and status.

Facilitates HMaster election and monitors RegionServer health.

HBase architecture diagram
HBase architecture diagram

Data Model

Data is organized in tables. Each row is identified by a unique RowKey . Columns are grouped into Column Families ; each family is stored in separate HFiles. Rows are sorted by RowKey, and a table is split into Regions that are distributed across RegionServers.

Column Family

Defines storage attributes such as in‑memory caching, compression, and encoding. Best practice: ≤ 3 column families per table to avoid management overhead.

RowKey

Acts as the primary key. Access patterns include single‑row get, range scans, and full‑table scans.

Region

A Region covers a contiguous RowKey range. A table starts with one Region; when a Region exceeds a size threshold (default ~10 GB) it splits into two, enabling horizontal scaling. Regions are the smallest unit of load balancing.

Timestamp

Each cell can have multiple versions identified by a timestamp. By default the latest timestamp is returned; older versions can be retrieved by specifying a timestamp.

Write Path

Client contacts ZooKeeper to locate the hbase:meta table.

Client discovers the RegionServer that hosts the target Region.

Client sends a write request to that RegionServer.

RegionServer appends the operation to the WAL (stored on HDFS) for durability.

The same operation is written to the in‑memory MemStore.

When MemStore size exceeds the flush threshold (default 64 MiB), it is flushed to a StoreFile (HFile) on HDFS.

WAL provides sequential, durable writes on HDFS (which is append‑only). MemStore keeps rows sorted by RowKey, enabling efficient bulk flushes to HFile. The HFile is the on‑disk representation of a StoreFile.

Read Path

Client queries ZooKeeper for the meta table location.

Client contacts the appropriate RegionServer.

RegionServer first checks the MemStore; a miss triggers a Bloom filter check to skip non‑existent rows.

If needed, the RegionServer scans the relevant StoreFile(s) to retrieve the requested cells.

Column‑Oriented vs Row‑Oriented Storage (OLAP Context)

Row‑oriented storage stores entire rows together, causing unnecessary I/O when only a subset of columns is needed. Column‑oriented storage stores each column’s values contiguously, allowing selective reads, fewer disk seeks, and better compression—advantages for analytical workloads.

Example SQL query on a row‑oriented table:

select name from emp where dept = A

Key Internal Structures

Store : One per column family; contains a MemStore and one or more StoreFiles (HFiles).

MemStore : In‑memory write buffer; flushed to StoreFile when size threshold is reached.

StoreFile / HFile : Immutable on‑disk file format; multiple StoreFiles may be compacted.

HLog (WAL) : Write‑ahead log stored on HDFS; used for recovery after failures.

ZooKeeper Role

ZooKeeper maintains cluster configuration, stores meta information, elects the active HMaster, and notifies the master of RegionServer failures, ensuring high availability.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Big Datadistributed databaseHBaseNoSQLdata-model
JavaEdge
Written by

JavaEdge

First‑line development experience at multiple leading tech firms; now a software architect at a Shanghai state‑owned enterprise and founder of Programming Yanxuan. Nearly 300k followers online; expertise in distributed system design, AIGC application development, and quantitative finance investing.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.