Practical Use of HBase in a Logistics HR Data Preprocessing Platform
This article details how the logistics HR data preprocessing platform processes around 20 million daily records by adopting HBase for high‑performance, scalable, column‑oriented storage, covering its architecture, read/write mechanisms, best practices, and performance considerations.
The logistics HR data preprocessing platform receives tens of thousands of employees' workload data daily, with a volume of about 20 million records per day and over 5 billion per month, requiring high‑performance queries and updates; the article explains how HBase is applied to meet these needs.
Background
Alternative solutions such as OSS, MySQL, Elasticsearch, and ClickHouse were evaluated, but OSS cannot meet high‑concurrency query performance, MySQL struggles with >1 billion rows, Elasticsearch lacks efficient single‑field updates and is costly, while ClickHouse is suited for OLAP.
Why HBase
HBase is an open‑source, distributed, column‑oriented database built on Hadoop, offering non‑structured data storage, high reliability, high performance, scalability, and low cost, making it suitable for the platform’s large, unstructured, high‑performance requirements.
Typical HBase Use Cases
Object storage for news, images, and virus databases.
Time‑series data via OpenTSDB on top of HBase.
Spatio‑temporal data for connected‑car applications.
Message/Order storage (e.g., Facebook stores billions of messages daily).
Feed streams such as WeChat Moments.
Basic Concepts
Namespace (similar to a MySQL database), table name, column family (a group of columns), column qualifier, and RowKey (the unique key for each row, stored as a key‑value pair).
HBase Architecture
HBase consists of three server types in a master‑slave model:
RegionServer – handles read/write operations; each server manages up to 1 000 regions.
HMaster – manages region assignment, DDL, and monitors RegionServers.
ZooKeeper – maintains cluster state, handles server health checks and HMaster election.
Cluster Cooperation
RegionServers send heartbeats to ZooKeeper; if a heartbeat is missed, ZooKeeper notifies HMaster, which triggers data recovery and failover. HMaster also maintains a heartbeat with ZooKeeper to ensure high availability.
Data Read Process
Client requests data and obtains MetaTable metadata from ZooKeeper.
Using MetaTable, the client discovers the RegionServer that holds the target RowKey.
Client first checks BlockCache, then MemStore, and finally reads from HFile if needed.
HFile reads use Bloom filters and time indexes to quickly locate the RowKey.
Multi‑level indexes are loaded into BlockCache to accelerate subsequent reads.
Data Write Process
Client issues a Put request; data is first written to the Write‑Ahead Log (WAL) on HDFS.
Data is then stored in the MemStore of the appropriate RegionServer.
After MemStore flushes (when reaching a threshold), data is sorted and written to an HFile on HDFS, with multi‑level indexes and a trailer.
Best Practices for This Platform
Horizontal scalability solves the >20 million daily record challenge.
Column families support thousands of columns, handling unstructured data and per‑column updates.
Millisecond‑level random read/write, real‑time access, high availability, multi‑level caching, automatic failover, and dual‑active deployment.
Built‑in compression reduces storage footprint.
Multi‑version support preserves historical changes.
TTL (time‑to‑live) automatically expires cold data.
Advantages & Disadvantages
Advantages: dynamic column addition, automatic region splitting, high concurrency support.
Disadvantages: only RowKey‑based queries, not suitable for wide scans, no transaction support.
Practical Tips
Prevent hotspotting by pre‑splitting tables (e.g., HexStringSplit) and designing RowKeys with unique, evenly distributed values (Snowflake ID + MD5).
Batch get size should stay below 100 KB for optimal performance.
Single row size should not exceed 400 KB.
Scan operations must specify startKey/endKey and limit to ~100 rows to avoid instability.
Reuse HBase connections; creating a new connection per request is expensive.
Performance reference charts (write TP99, update TP99, query TP99) illustrate the platform’s latency under real workloads.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
JD Tech
Official JD technology sharing platform. All the cutting‑edge JD tech, innovative insights, and open‑source solutions you’re looking for, all in one place.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
