Databases 12 min read

Using HBase for HR Performance Data Preprocessing Platform: Architecture, Concepts, and Best Practices

This article introduces the HR performance data preprocessing platform’s requirements, explains why HBase was selected as the storage solution, details its core concepts, architecture, data write/read processes, best practices, limitations, and presents performance metrics demonstrating its suitability for large‑scale, high‑throughput workloads.

Architecture Digest

Jun 21, 2021

Using HBase for HR Performance Data Preprocessing Platform: Architecture, Concepts, and Best Practices

Background

The HR performance data preprocessing platform receives all upstream business volume data and faces characteristics such as massive, unstructured data, frequent single‑record updates, and high query performance demands. Common storage options like OSS, MySQL, and Elasticsearch either cannot meet query/update performance or are costly, leading to the selection of HBase.

Current Situation

Considering the platform’s large, unstructured data and the need for high performance and open‑source stability, HBase was chosen as the storage solution.

HBase Applicable Scenarios

Object Storage: News articles, web pages, images, and even virus databases are stored in HBase.

Time‑Series Data: OpenTSDB on top of HBase satisfies time‑series requirements.

Recommendation Profiles: Large sparse matrices for user profiling, such as Ant Financial’s risk control, are built on HBase.

Spatio‑Temporal Data: Trajectory and weather grid data, e.g., Didi’s taxi trajectories, are stored in HBase.

Message/Order Systems: Many telecom and banking order‑query back‑ends rely on HBase.

Feeds Streams: Social‑feed‑like applications use HBase for storage.

HBase Basic Concepts

Namespace: Equivalent to a database name in MySQL.

Table Name: Equivalent to a table name in MySQL.

Column Family: A collection of columns; a column family can contain many columns.

Column Name: An individual column under a column family.

RowKey: The primary key for HBase; all operations are based on a unique RowKey.

HBase Overall Architecture

HBase consists of three server types in a master‑slave mode:

Region Server: Handles read/write services; each Region Server contains up to 1,000 Regions, each covering a range of RowKeys.

HBase HMaster: Assigns Regions, performs DDL, monitors Region Servers, and handles cluster recovery and dynamic adjustments.

ZooKeeper: Maintains cluster state, synchronizes data between servers, and performs HMaster election.

Cluster Coordination

Region Servers maintain a heartbeat with ZooKeeper, creating a temporary node. If the heartbeat is lost, ZooKeeper is notified and the node is removed. HMaster obtains Region Server status via ZooKeeper and performs data recovery and failover when a server goes offline. HMaster also keeps a heartbeat with ZooKeeper; if HMaster fails, an election selects a new active HMaster to ensure high availability.

Data Write Process

1. The HBase client issues a Put request, first writing the data to the Write‑Ahead Log (WAL) stored in HDFS.

2. After WAL persistence, the data is written to the MemStore cache of the target Region on the Region Server.

3. Once MemStore write succeeds, the client receives a success acknowledgment.

4. When MemStore reaches a threshold, it is flushed to an HFile on HDFS. Before persisting, keys are sorted, and a multi‑level index and a trailer pointer are created.

Data Read Process

1. The client retrieves metadata from ZooKeeper’s MetaTable (or local cache if available).

2. Using the MetaTable, the client determines which Region Server holds the desired RowKey.

3. The client first checks the BlockCache; if missing, it looks in MemStore; if still missing, it reads from the HFile.

4. While reading an HFile, the Bloom filter and time range in the trailer quickly determine whether the RowKey exists.

5. After reading, the multi‑level index is loaded into BlockCache to accelerate subsequent reads.

Best Practices

Key Advantages of HBase for the HR Platform

1. Distributed column‑oriented database that scales horizontally to handle massive data volumes.

2. Supports hundreds to thousands of columns per column family, solving unstructured data storage and single‑column updates.

3. Provides millisecond‑level random read/write, real‑time access, high availability, multi‑level caching, seamless service continuity, automatic master‑slave failover, and active‑active cross‑region deployment.

4. Built‑in compression algorithms reduce storage footprint.

5. Multi‑version support retains historical data.

6. Data TTL feature automatically deletes long‑unused data.

Drawbacks

1. RowKey design is critical; it must be unique and well‑hashed. Queries rely on RowKey, so secondary indexes (e.g., in Elasticsearch) are often needed.

2. Column‑oriented storage means queries cannot target individual columns without using the RowKey.

Precautions

Data Hotspot Mitigation: HBase creates 10 Regions per table by default. Use pre‑splitting (e.g., HexStringSplit) and design RowKeys with high cardinality (e.g., Snowflake ID + MD5) to distribute load evenly.

Batch Get Size: Keep batch query size below 100 rows; larger batches cause noticeable performance degradation.

Quota and Rate Limiting: Instance quotas (e.g., 10,000 QPS) apply to the sum of reads and writes across all tables. Exceeding the quota triggers alerts and retries; request higher quotas or batch operations if needed.

Single Row Size Limit: Avoid rows larger than 400 KB (default max 512 KB) to prevent severe performance loss.

Connection Handling: Establish HBase connections at application startup rather than per request, as each connection incurs high latency.

Performance Reference in the HR Platform

HBase Write TP99 (last 30 days)

HBase Write Average

HBase Query TP99

HBase Query Average

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Optimization Big Data Database Architecture distributed database HBase Data preprocessing

Written by

Architecture Digest

Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.