HBase Availability and Latency Optimizations: Replication‑Based Multi‑Read and ZGC Adoption
To overcome HBase’s weak availability and GC‑induced latency spikes, the DiDi team introduced a replication‑based client multi‑read (hedged‑read) mechanism and migrated to the Z Garbage Collector, which together dramatically cut maximum and 99.9th‑percentile latencies while keeping services online during region disruptions.
HBase is a low‑cost, distributed LSM‑based database built on HDFS that supports millisecond‑level queries and PB‑scale storage, and is widely used across DiDi’s business lines. Despite its strengths, HBase suffers from two prominent drawbacks: relatively weak availability and severe latency spikes ("jitter").
In the second half of 2020 the HBase team shifted focus toward front‑end and near‑real‑time workloads. Two major pain points were identified: (1) weak availability caused by HBase’s choice of consistency over availability in the CAP theorem, leading to temporary region unavailability during region migration, split, merge, or RegionServer crashes; (2) noticeable latency spikes largely induced by Java GC and the shared HDFS infrastructure.
On the availability side, the lack of region replicas means that any region‑level disruption results in short‑term service outages, which is unacceptable for high‑throughput, low‑latency services. Community‑provided region replica mitigates the issue but its reliability is still evolving and it does not satisfy multi‑datacenter disaster‑recovery requirements.
The jitter problem stems from HBase’s dependence on Java GC and HDFS; GC pauses manifest as latency spikes that degrade user experience.
To tackle these challenges, the team explored two directions: (1) a replication‑based client multi‑read mechanism, and (2) the adoption of the Z Garbage Collector (ZGC) in the HBase runtime.
The multi‑read solution leverages HBase’s asynchronous replication (WAL‑based master‑slave sync) and introduces a client‑side hedged‑read strategy similar to HDFS’s hedgedRead. Configuration parameters such as hbase.client.hedged.read, hbase.client.hedged.read.timeout, and hbase.zookeeper.quorum.hedged.read were added to enable and tune the feature. The design proceeds in three stages: (a) basic replication master‑slave, (b) replication + failover using ZooKeeper for automatic service switch, and (c) client‑side multi‑read that issues parallel reads to primary and secondary clusters, returning the first successful response.
Performance tests using YCSB (1 M rows) compared a control group (multi‑read disabled) with an experimental group (multi‑read enabled). Results showed that multi‑read significantly reduced the maximum latency and the 99.9th percentile (P999), effectively smoothing out spikes.
Design trade‑offs were discussed: multi‑read provides eventual consistency, so data read from the standby cluster may lag behind the primary; it only affects the first RPC of a scan, and subsequent RPCs remain bound to the chosen cluster. The team also compared active‑active (multi‑read) versus active‑passive (replication + failover) architectures.
Regarding GC latency, the team adopted ZGC because G1’s pause times could not meet the stringent latency requirements of DiDi’s front‑end services. After JDK 15 released ZGC as a production‑ready collector, the team selected AdoptOpenJDK 15, resolved compilation issues (missing classes, module exports, outdated dependencies, Maven plugin upgrades), and rebuilt HBase.
Benchmarks demonstrated that ZGC reduced scan‑stage P99 latency by ~20 % and P999 latency by ~40 % compared to G1. Under heavy write pressure that caused Full GC pauses >40 s with G1 (leading to RegionServer crashes), ZGC kept 99.93 % of pauses under 10 ms, preventing service outages.
In conclusion, while HBase remains a strong candidate for massive offline data processing, its inherent availability and jitter limitations have hindered front‑end adoption. The explored replication‑based multi‑read and ZGC integration substantially improve availability and latency, bringing HBase closer to meeting the demands of latency‑sensitive, online workloads.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
