Databases 8 min read

Mastering HBase: From Basics to Architecture and Cluster Design

This article introduces HBase, its origins from Google Bigtable, core concepts such as RowKey, Column Family, and Versioning, and explains its logical and physical table views, storage mechanisms, and cluster architecture within the Hadoop ecosystem.

Alibaba Cloud Developer

Apr 19, 2019

Mastering HBase: From Basics to Architecture and Cluster Design

1. HBase Introduction

In October 2006 Google released the seminal Bigtable paper, and shortly after Powerset announced HBase as a sub‑project of Hadoop, which later graduated to a top‑level Apache project around 2010. Many people only associate HBase with NoSQL, but it is fundamentally a distributed, column‑oriented storage system built on Hadoop.

HBase derives its name from "Hadoop Database" and is designed to store unstructured or semi‑structured data. It sits on top of HDFS, inheriting HDFS's reliability and scalability, while MapReduce, Pig, Hive, and Sqoop provide computation and data‑migration capabilities.

HBase is the open‑source implementation of Google’s Bigtable model, sharing its sparse, column‑family design and key‑value characteristics, though there are implementation differences. Coordination is handled by Zookeeper, analogous to Bigtable’s use of Chubby.

2. Basic Concepts

RowKey : The unique primary key for a row, up to 64 KB, stored as a byte array and sorted lexicographically. Proper RowKey design can improve scan performance.

Column Family : A group of columns defined at table creation (typically up to ~20 families). All columns in a family share the same physical storage file.

Column : Belongs to a column family; a family can contain millions of dynamic columns, enabling flexible schema evolution.

Version Number : Each cell value is versioned, defaulting to a timestamp in milliseconds. Users can set custom timestamps or limit the number of retained versions.

Cell : Identified uniquely by RowKey, column family, column qualifier, and version; stores raw bytes without type information.

3. Logical Table View

HBase tables can be visualized as a sparse two‑dimensional spreadsheet where many cells are empty and do not consume storage on disk.

4. Physical Table View

The physical layout consists of several layers:

Table → Region (horizontal split)

Region split and distribution across RegionServers

Region storage structure

A Region contains one or more Stores; each Store corresponds to a column family and consists of a memStore (in‑memory) and zero or more storeFiles (HFiles) persisted in HDFS. Data is first written to memStore; when it exceeds a threshold, it is flushed to a storeFile.

5. Cluster Architecture

An HBase cluster typically comprises a single Master node and multiple RegionServer nodes.

Client libraries : Provide language‑specific APIs and maintain a local cache of region locations for fast access.

Master : Assigns Regions to RegionServers, handles load balancing, and manages table metadata and CRUD operations.

RegionServer : Hosts Regions, serves read/write requests, and splits oversized Regions during runtime.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

database HBase NoSQL Distributed storage Hadoop Bigtable

Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.