Why HBase Is the Ideal Choice for Large‑Scale HR Data Preprocessing
This article explains how HBase’s distributed column‑oriented architecture, high‑performance read/write capabilities, and flexible schema make it a cost‑effective solution for handling massive, unstructured HR performance data, covering its core concepts, cluster operation, best practices, and performance metrics.
Background
The HR performance data preprocessing platform receives massive, unstructured upstream business data with high update and query performance requirements. Traditional storage options like OSS, MySQL, and Elasticsearch each have drawbacks: OSS cannot meet performance for single‑record updates, MySQL becomes complex when creating a table per business type, and Elasticsearch is costly and unfriendly to single‑field updates.
Current Situation
Considering the platform’s characteristics—large volume, unstructured data, high performance, and open‑source stability—HBase was selected.
What Is HBase?
HBase is an open‑source, distributed, column‑oriented database built on Hadoop, offering Bigtable‑like capabilities. It provides high reliability, scalability, and can be deployed on inexpensive PC servers, delivering excellent cost‑performance for large‑scale structured storage.
HBase Use Cases
Object Storage : News articles, web pages, images, and even virus databases are stored in HBase.
Time‑Series Data : OpenTSDB on HBase supports time‑series scenarios.
Recommendation Profiles : Sparse user‑profile matrices, such as Ant Financial’s risk control system, are built on HBase.
Spatio‑Temporal Data : Trajectory and weather grid data (e.g., Didi’s GPS traces) reside in HBase.
Message/Order Data : Telecom and banking order queries, as well as messaging synchronization, rely on HBase.
Feeds Stream : Social‑feed‑like applications use HBase for storage.
Basic Concepts
Namespace : Equivalent to a database name in MySQL.
Table Name : Equivalent to a MySQL table name.
Column Family : A group of columns; a table can have many column families.
Column Name : An individual column within a column family.
RowKey : The unique key for each row; all operations are based on the RowKey.
HBase Architecture
HBase consists of three server types in a master‑slave model:
RegionServer : Handles data read/write; each server hosts up to 1,000 regions, each covering a range of RowKeys.
HBase Master (HMaster) : Assigns regions, performs DDL, and monitors RegionServers.
ZooKeeper : Maintains cluster state, synchronizes data, and manages HMaster election.
Cluster Coordination
RegionServers send heartbeats to ZooKeeper, creating temporary nodes. If a heartbeat is missed, ZooKeeper removes the node, prompting HMaster to detect the failure and trigger data recovery and failover. HMaster also uses ZooKeeper heartbeats to monitor its own status and elect a new active master when needed, ensuring high availability.
Data Write Process
The HBase client issues a Put request, which first writes to the Write‑Ahead Log (WAL) stored in HDFS.
The data is then written to the RegionServer’s MemStore (in‑memory cache).
After MemStore write succeeds, the client receives a success acknowledgment.
When MemStore reaches a threshold, it flushes to an HFile on HDFS, sorting keys and building multi‑level indexes and a trailer pointer.
Data Read Process
The client queries ZooKeeper for the MetaTable to locate the region containing the desired RowKey.
Using MetaTable information, the client identifies the appropriate RegionServer.
The client first checks the Region’s BlockCache; if missing, it looks in MemStore; if still missing, it reads from the HFile.
During HFile reads, Bloom filters and timestamps in the trailer quickly determine RowKey presence.
Read data is cached in BlockCache for faster subsequent accesses.
Best Practices for the HR Platform
Key Features Aligned with Platform Needs
Horizontal scalability to handle large data volumes.
Column‑family storage supporting hundreds of columns, ideal for unstructured data and single‑column updates.
Millisecond‑level random read/write, multi‑level caching, automatic failover, and cross‑region replication.
Built‑in compression to reduce storage footprint.
Multi‑version support for historical data.
Data TTL (time‑to‑live) for automatic cleanup of stale records.
Drawbacks
RowKey design is critical; it must be unique and well‑hashed, often requiring secondary indexes in external systems for non‑RowKey queries.
Column‑oriented storage means queries cannot target individual columns without using the RowKey.
Operational Considerations
Hotspot Mitigation : Pre‑split tables (e.g., HexStringSplit) and design RowKeys with salted hashes (e.g., Snowflake ID + MD5) to distribute load across regions.
Batch Get Size : Keep batch reads under 100 rows to avoid performance degradation.
Quota & Rate Limiting : Monitor QPS limits (e.g., 10,000 ops/sec) and request higher quotas or batch operations if needed.
Row Size Limit : Avoid rows larger than 400 KB; HBase defaults to a 512 KB value limit.
Connection Management : Create HBase connections at application startup, not per request, due to high connection overhead.
Performance Reference in the HR Platform
HBase Write TP99 (last 30 days)
HBase Write Average
HBase Query TP99
HBase Query Average
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITFLY8 Architecture Home
ITFLY8 Architecture Home - focused on architecture knowledge sharing and exchange, covering project management and product design. Includes large-scale distributed website architecture (high performance, high availability, caching, message queues...), design patterns, architecture patterns, big data, project management (SCRUM, PMP, Prince2), product design, and more.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
