In‑Depth Overview of HBase Architecture
This article provides a comprehensive, illustrated explanation of Apache HBase's architecture, covering its master‑slave components, region management, Zookeeper coordination, data flow for reads and writes, storage structures, compaction processes, fault recovery, and the system's strengths and limitations within the Hadoop ecosystem.
HBase Architecture Components
Physically, HBase follows a master‑slave architecture composed of three server types: RegionServer (handles read/write requests directly with clients), HBase Master (allocates regions and performs DDL operations), and Zookeeper (maintains cluster state as part of HDFS).
Underlying storage relies on Hadoop HDFS: DataNode stores the data managed by RegionServers, providing data locality; NameNode holds metadata for all HDFS blocks.
Regions
HBase tables are horizontally split into regions based on rowkey ranges. Each region has a start key and end key and is managed by a RegionServer; a typical RegionServer can handle about 1,000 regions.
HBase Master
Also called HMaster, it is responsible for region allocation, DDL operations, and monitoring all RegionServers via Zookeeper notifications. It provides APIs for creating, deleting, and updating tables.
Zookeeper
Zookeeper acts as the distributed coordination service, maintaining the health status of all servers and notifying participants of failures. It uses a consensus protocol that requires three or five machines to achieve consistency.
How the Components Work Together
RegionServers and the active HMaster maintain sessions with Zookeeper. Zookeeper creates ephemeral nodes for each RegionServer; the master watches these nodes to detect available servers and failures. When a server crashes, its ephemeral node disappears, prompting the active master to reassign regions.
Commentary: Zookeeper is the communication bridge; all participants keep heartbeats with it and obtain cluster state information, a core concept in distributed system design.
First Read and Write Operations
The special Meta table stores the location of every region. When a client issues its first read/write, it obtains the RegionServer responsible for the Meta table from Zookeeper, then queries that server to locate the RegionServer handling the target rowkey, caching this information for subsequent requests.
Commentary: The client’s read/write consists of two steps—locating the region via the Meta table and then accessing the appropriate RegionServer.
HBase Meta Table
The Meta table is a special HBase table that holds a list of all regions. Its key consists of table name, region start key, and region ID; the value is the RegionServer address.
RegionServer Components
A RegionServer runs on an HDFS DataNode and comprises:
WAL (Write‑Ahead Log) for persisting unflushed data and crash recovery.
BlockCache – an LRU in‑memory cache for frequently accessed data.
MemStore – an in‑memory write buffer for each column family.
HFile – on‑disk storage of ordered KeyValue pairs.
Commentary: Understanding the RegionServer’s internal components is crucial for grasping HBase’s overall architecture.
HBase Write Path
When a client issues a Put, the data is first appended to the WAL, then added to the MemStore. The server acknowledges the write after these steps.
Commentary: The order (WAL → MemStore) is essential; reversing it would risk data loss on a crash.
HBase MemStore
MemStore caches updates in memory as ordered KeyValues, mirroring the on‑disk HFile format. Each column family has its own MemStore.
HBase Region Flush
When a MemStore accumulates enough data, the ordered dataset is written to a new HFile on HDFS. Each column family gets its own HFile, and the process records the maximum sequence number for recovery purposes.
Commentary: Sequence numbers act as commit points, indicating which data has been persisted.
HBase HFile
Data is stored in HFiles as Key/Value pairs. Writes are sequential appends, which are fast because HDFS supports only append, not random writes.
HFile Index Structure
HFiles use a multi‑level B+‑tree‑like index: ordered KeyValues, rowkey → index → data block (64 KB), each block has a leaf index, and the last key of each block is stored in a middle‑level index, whose root points to the middle index. The trailer at the file end holds Bloom filters and time‑range metadata.
HBase Read Merge
A read operation merges cells from three sources: BlockCache (LRU cache of recently read cells), MemStore (recent writes), and HFile (persisted data). If a cell is not found in the first two, the scanner uses the block index and Bloom filter to load the appropriate HFile.
This can cause read amplification when multiple HFiles must be consulted for the same row.
Commentary: Multiple HFiles for the same rowkey lead to read amplification; compaction mitigates this.
HBase Minor Compaction
Small HFiles are periodically merged into fewer larger files using a merge‑sort algorithm, reducing the total number of HFiles.
HBase Major Compaction
All HFiles under a column family are rewritten into a single large HFile, permanently removing deleted or expired cells and improving read performance. This process incurs significant I/O and network traffic, known as write amplification, and is usually scheduled during off‑peak hours.
Region = Contiguous Keys
Each HBase table is horizontally split into regions, each covering a continuous range of ordered rows defined by start and end keys. Default region size is 1 GB, and a RegionServer can manage roughly 1,000 regions.
Region Splitting
When a region grows too large, it splits into two child regions, each holding half the data. The split is reported to the HMaster, which may move the new regions to other RegionServers for load balancing.
Read Load Balancing
After splitting, the HMaster may relocate new regions to different RegionServers, causing some servers to read data from distant HDFS blocks until a major compaction relocates the data back near the RegionServer.
Commentary: The migration here is logical—assigning a region to a different server—not physical data movement.
HDFS Data Replication
All reads and writes occur on HDFS DataNodes. HDFS automatically replicates WAL and HFile blocks (default three copies) to ensure data durability.
HBase Fault Recovery
If a RegionServer crashes, its regions become unavailable until the failure is detected via Zookeeper heartbeats. The HMaster then reassigns those regions to healthy servers. To recover unflushed data, the master splits the WAL into fragments, distributes them to the new servers, and replays them into MemStores.
WAL entries are ordered modifications (puts or deletes) written sequentially to the file tail. During recovery, the WAL is replayed: modifications are read, sorted, applied to MemStore, and eventually flushed to HFiles.
Commentary: WAL is the cornerstone of HBase reliability; during recovery, its fragments are replayed to rebuild MemStores on new RegionServers.
Advantages of Apache HBase
Strong consistency – once a write returns, all readers see the same value.
Automatic scalability – regions split as data grows; data is stored on HDFS with built‑in replication.
Built‑in recovery – uses Write‑Ahead Log for crash recovery.
Integration with Hadoop – MapReduce jobs can process HBase data directly.
Disadvantages of Apache HBase
WAL replay can be slow.
Fault recovery may be time‑consuming.
Major compaction causes I/O spikes.
Source: https://segmentfault.com/a/1190000019959411
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology Architecture
Exploring Open Source Big Data and AI Technologies
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
