Big Data 3 min read

Understanding HBase RegionServer, HRegion, HStore, and Column Family Management

The article explains HBase's RegionServer management of regions and stores, detailing HStore composition, MemStore flushing, split conditions, column family sharing within regions, and the performance implications of multiple column families, recommending a single column family design for optimal I/O efficiency.

Big Data Technology & Architecture
Big Data Technology & Architecture
Big Data Technology & Architecture
Understanding HBase RegionServer, HRegion, HStore, and Column Family Management

HRegionServer internally manages a series of HRegion objects, each corresponding to a region in a table; each HRegion consists of multiple HStore instances.

Each HStore corresponds to a column family in the table, acting as a dedicated storage unit, so column families with similar I/O characteristics should be grouped together for maximum efficiency.

HStore storage is the core of HBase storage and comprises two parts: MemStore and StoreFile. MemStore is a sorted memory buffer where incoming writes are first stored; when MemStore fills up, it is flushed to a StoreFile (implemented as an HFile).

The condition for splitting a region is that the largest StoreFile among all stores in the region exceeds a predefined threshold.

At the file level, different column families are stored in separate files, but multiple column families can share the same region.

For example, the following paths show two different column families sharing the same region /hbase/zz/3917ebd872c0adcb9d6c5a9cfd30b87f/a and /hbase/zz/3917ebd872c0adcb9d6c5a9cfd30b87f/b.

Because column families share a region, a situation may arise where one column family contains millions of rows while another has only a few; when a region split is triggered, the small column family is also split across many regions, leading to a cardinality problem and degraded scan performance.

Additionally, flushing one column family can cause neighboring column families to flush due to coupling effects, increasing I/O.

Therefore, it is generally recommended not to define multiple column families in a table.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

HBasestoragebigdataRegionServerColumnFamily
Big Data Technology & Architecture
Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.