Mastering HBase: From Basics to Architecture and Cluster Design
This article introduces HBase, its origins from Google Bigtable, core concepts such as RowKey, Column Family, and Versioning, and explains its logical and physical table views, storage mechanisms, and cluster architecture within the Hadoop ecosystem.
1. HBase Introduction
In October 2006 Google released the seminal Bigtable paper, and shortly after Powerset announced HBase as a sub‑project of Hadoop, which later graduated to a top‑level Apache project around 2010. Many people only associate HBase with NoSQL, but it is fundamentally a distributed, column‑oriented storage system built on Hadoop.
HBase derives its name from "Hadoop Database" and is designed to store unstructured or semi‑structured data. It sits on top of HDFS, inheriting HDFS's reliability and scalability, while MapReduce, Pig, Hive, and Sqoop provide computation and data‑migration capabilities.
HBase is the open‑source implementation of Google’s Bigtable model, sharing its sparse, column‑family design and key‑value characteristics, though there are implementation differences. Coordination is handled by Zookeeper, analogous to Bigtable’s use of Chubby.
2. Basic Concepts
RowKey : The unique primary key for a row, up to 64 KB, stored as a byte array and sorted lexicographically. Proper RowKey design can improve scan performance.
Column Family : A group of columns defined at table creation (typically up to ~20 families). All columns in a family share the same physical storage file.
Column : Belongs to a column family; a family can contain millions of dynamic columns, enabling flexible schema evolution.
Version Number : Each cell value is versioned, defaulting to a timestamp in milliseconds. Users can set custom timestamps or limit the number of retained versions.
Cell : Identified uniquely by RowKey, column family, column qualifier, and version; stores raw bytes without type information.
3. Logical Table View
HBase tables can be visualized as a sparse two‑dimensional spreadsheet where many cells are empty and do not consume storage on disk.
4. Physical Table View
The physical layout consists of several layers:
Table → Region (horizontal split)
Region split and distribution across RegionServers
Region storage structure
A Region contains one or more Stores; each Store corresponds to a column family and consists of a memStore (in‑memory) and zero or more storeFiles (HFiles) persisted in HDFS. Data is first written to memStore; when it exceeds a threshold, it is flushed to a storeFile.
5. Cluster Architecture
An HBase cluster typically comprises a single Master node and multiple RegionServer nodes.
Client libraries : Provide language‑specific APIs and maintain a local cache of region locations for fast access.
Master : Assigns Regions to RegionServers, handles load balancing, and manages table metadata and CRUD operations.
RegionServer : Hosts Regions, serves read/write requests, and splits oversized Regions during runtime.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
