Big Data 17 min read

HBase Architecture, Components, and Operations Overview

This article provides a comprehensive overview of Apache HBase’s architecture, detailing its core components such as RegionServer, HMaster, ZooKeeper, WAL, MemStore, and HFiles, and explains key processes including read/write paths, compaction, region splitting, load balancing, and recovery mechanisms.

Big Data Technology & Architecture

Aug 27, 2020

HBase Architecture, Components, and Operations Overview

HBase Architecture Overview

Physically, HBase consists of three types of servers in a master‑slave mode: RegionServer, HBase HMaster, and ZooKeeper. RegionServers handle data read/write, HMaster manages region assignment and table creation/deletion, and ZooKeeper maintains cluster state and master election.

Hadoop DataNode stores the actual data files; RegionServers are placed on DataNodes to keep data local. NameNode maintains metadata for HDFS blocks.

Regions

Tables are horizontally split into regions based on row keys; each region contains rows between a start and end key. A RegionServer typically manages about 1,000 regions.

HBase HMaster

HMaster responsibilities include controlling RegionServer work (assigning regions at startup, rebalancing, monitoring via ZooKeeper) and managing tables (create, delete, update).

ZooKeeper

ZooKeeper coordinates distributed state, monitors server liveness, provides notifications for failures, and conducts master election. The cluster should have an odd number of servers for reliable election.

Interaction Between Components

Each RegionServer creates an ephemeral node in ZooKeeper; HMaster watches these nodes to detect failures and to trigger recovery or re‑assignment. Active HMaster sends heartbeats; standby HMaster monitors the active one.

First Read/Write Operations

Clients obtain the META table location from ZooKeeper, query the appropriate RegionServer for the target row key, and then perform read/write operations. Subsequent operations use cached RegionServer addresses unless a server becomes unavailable.

META Table

The META table stores region address information in a B‑tree‑like structure (key: region start key and ID; value: RegionServer).

RegionServer Components

WAL (Write‑Ahead Log) for durability and recovery.

Block Cache for read caching.

MemStore for write buffering; one MemStore per column family.

HFiles stored on HDFS, containing sorted key‑value pairs.

Write Path

Step 1: Client PUT is written to WAL. Step 2: Data is stored in MemStore and the client receives acknowledgment. When MemStore reaches a threshold, its contents are flushed to a new HFile on HDFS.

HFile Structure and Indexing

HFiles contain multi‑level indexes (root, intermediate, leaf) and a meta block with bloom filters and timestamps, enabling efficient reads without scanning the entire file.

Read Path and Read Amplification

Reads first check Block Cache, then MemStore, and finally HFiles (using indexes and bloom filters). Because data may reside in multiple HFiles, read amplification can occur.

Compaction

Minor compaction merges small HFiles into larger ones. Major compaction rewrites all HFiles of a column family into a single file, discarding deleted/expired cells, but incurs heavy I/O and temporary unavailability.

Region Splitting and Load Balancing

When a region grows beyond ~1 GB, it is split into two sub‑regions; HMaster may reassign them to different RegionServers for load balancing. This can cause remote data access until a subsequent major compaction brings data local.

Data Replication and Recovery

HDFS replicates WAL and HFiles across three nodes for reliability. In case of RegionServer failure, ZooKeeper notifies HMaster, which reassigns regions and replays WAL to rebuild MemStore.

Advantages and Disadvantages of HBase

Strong consistency model.

Automatic scaling via region splitting.

Built‑in recovery using WAL.

Good integration with Hadoop/MapReduce.

Drawbacks: slower WAL recovery, complex crash recovery, resource‑intensive major compaction.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

distributed systems Big Data Compaction Database Architecture HBase NoSQL Recovery

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.