Understanding HBase’s Physical Architecture: Regions, Stores, and WAL
This article explains HBase’s internal architecture, covering the roles of HRegionServer, Client, Zookeeper, Master, RegionServer, the physical storage layout, StoreFile and HFile structures, and the Write-Ahead Log mechanism that ensures data durability and fault tolerance.
HRegionServer Workflow
HRegionServer opens regions and creates HRegion instances. For each table’s HColumnFamily, it creates a Store, which contains one or more StoreFile objects (lightweight wrappers around HFile). Each Store is paired with a MemStore. Writes first go to the HLog, then to the MemStore. Because MemStore space is limited, it is periodically flushed to new StoreFiles, which are persisted as HFiles on HDFS.
Client
The client is the entry point for accessing an HBase cluster. It uses HBase’s RPC mechanism to communicate with HMaster for administrative operations and with RegionServers for read/write operations. The client maintains caches (e.g., region location cache) to speed up access.
Zookeeper
Zookeeper guarantees that only one Master runs at a time, stores region address information, monitors RegionServer status, and holds schema and table metadata. It also provides fault‑tolerance for the Master by electing a new Master if the current one fails.
Master
The Master handles table DDL (create, delete, alter), assigns new regions after splits, balances region load across RegionServers, and reassigns regions when a RegionServer goes down. If the Master fails, metadata cannot be modified, but data reads and writes continue.
RegionServer
RegionServers manage the I/O for their assigned regions, handle region splits when a region grows too large, and serve client read/write requests directly without involving the Master.
Physical Storage Model
Rows in a table are sorted by rowkey. Tables are split horizontally into multiple regions, each defined by a [startkey, endkey) range. Regions grow until a size threshold is reached, then split into two new regions. Regions are the smallest unit for distribution and load balancing, but they are not the smallest storage unit.
1. A region consists of one or more Stores; each Store holds a column family. 2. Each Store contains a MemStore and zero or more StoreFiles. 3. MemStore resides in memory; StoreFiles reside on HDFS.
Table and Region Internal Structure
1. A table is divided into multiple regions, each assigned to a specific RegionServer. 2. Within a region, data is further divided by column family into HStores. 3. Each HStore’s data is persisted in several HFile files. 4. Regions grow with data insertion and split when they exceed a threshold. 5. As regions split, a RegionServer may manage an increasing number of regions. 6. HMaster balances load based on the number of regions per RegionServer. 7. Data is first read from the in‑memory MemStore cache. 8. MemStore data is periodically flushed to new StoreFiles. 9. StoreFiles accumulate over time; RegionServers periodically merge many StoreFiles to reduce file count.
StoreFile Structure
StoreFile (HFile) consists of several blocks:
Data Block : stores table data, optionally compressed.
Meta Block (optional): stores user‑defined key/value pairs, optionally compressed.
File Info : metadata for the HFile, not compressed; users can add custom metadata here.
Data Block Index : index of Data Blocks; each entry’s key is the first record’s key in the block.
Meta Block Index (optional): index for Meta Blocks.
Trailer : fixed‑length section at the end of the file containing offsets for all other sections; enables fast block lookup without scanning the whole file.
Data and Meta blocks are typically compressed (Gzip or Lzo) to reduce network and disk I/O, at the cost of CPU for (de)compression. The Trailer and FileInfo blocks are uncompressed.
HFile Format Details
HFile length is variable; only Trailer and FileInfo have fixed size. The Trailer points to the start of each block and is written when the file is closed, making it immutable. FileInfo records metadata such as AVG_KEY_LEN, AVG_VALUE_LEN, LAST_KEY, COMPARATOR, MAX_SEQ_ID_KEY, etc. Data Block size can be configured per table: larger blocks favor sequential scans, smaller blocks favor random reads.
Each Data Block begins with a magic number followed by a sequence of KeyValue pairs. The magic number helps detect corruption.
KeyValue Format
Each KeyValue consists of:
KeyLength and ValueLength : fixed‑size fields indicating the lengths of the key and value.
Key : composed of RowLength (fixed size), RowKey, ColumnFamilyLength, ColumnFamily, Qualifier, Timestamp, and KeyType (Put/Delete).
Value : raw binary data without additional structure.
Zookeeper’s Role
1. HBase relies on ZooKeeper; by default HBase manages ZooKeeper start/stop. 2. Masters and RegionServers register themselves with ZooKeeper on startup. 3. ZooKeeper eliminates the Master as a single point of failure.
Locating a RegionServer
1. ZooKeeper provides the location of the ROOT table. 2. The ROOT table points to the .META. table (stored on a single region). 3. The .META. table contains the actual locations of user tables’ regions. 4. Finally, the client accesses the user table data.
HBase Fault Tolerance
Master fault tolerance : ZooKeeper elects a new Master if the current one fails. Reads continue, but region splits and load balancing pause.
RegionServer fault tolerance : RegionServers send heartbeats to ZooKeeper. If a heartbeat is missed, the Master reassigns the lost regions to other RegionServers and the HLog of the failed server is replayed on the new server.
ZooKeeper fault tolerance : Typically deployed as an ensemble of 3 or 5 nodes to ensure reliability.
Write‑Ahead Log (WAL)
Each HRegionServer has an HLog object implementing WAL. When a client writes data, the write is first appended to the WAL; only after the WAL write succeeds does the client receive an acknowledgment. The WAL is periodically rolled, and old logs are deleted after their data has been flushed to StoreFiles.
If a RegionServer crashes, the Master detects it via ZooKeeper, reads the remaining WAL files, splits them by region, and assigns the regions to other RegionServers. During region loading, the new RegionServer replays the WAL entries into its MemStore and then flushes them to StoreFiles, restoring the lost data.
Overall, HBase’s architecture—combining in‑memory MemStore, on‑disk StoreFiles, ZooKeeper coordination, and WAL durability—provides scalable, fault‑tolerant storage for massive datasets.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
