Enhancing HBase CAP Model and MTTR with Kafka‑Based IO Decoupling and Native AP Support
The article analyzes HBase's CP‑oriented CAP limitations, proposes native AP support via Replica, decouples WAL IO to Kafka, optimizes MTTR, introduces multi‑datacenter active/active disaster recovery, and redesigns client write paths and LogSplit processing for higher availability and throughput.
HBase primarily follows a CP model, providing strong consistency for writes but suffering long MTTR when a RegionServer crashes, due to slow node‑failure detection (ZK session timeout) and costly LSM‑tree log replay.
To address AP‑oriented workloads that tolerate delayed visibility but demand high availability, HBase offers a native Replica feature that adds extra read‑only Region replicas; however, this adds IO overhead, only protects reads, and does not help cross‑datacenter disaster recovery.
The proposed solution decouples WAL‑related IO from the HBase cluster by offloading it to Kafka, reducing overall disk IO pressure and allowing both WAL writes and Replica synchronization to be handled by Kafka.
MTTR is further reduced by routing client writes through a SDK that first records WAL entries to Kafka; the client then relies on Kafka's faster failure detection and ISR‑based failover, eliminating the long HBase log‑replay phase.
For multi‑datacenter disaster recovery, independent HBase clusters are deployed per site and synchronized via native Replication; by switching from Active/Standby to Active/Active mode and leveraging Kafka‑based WAL storage, cross‑site failover becomes more seamless.
Client dual‑write is enabled through a CompositeConnection that holds two separate HBase connections; write requests are sent to both clusters (invokeAll) while read requests use invokeAny, improving SLA for eventually consistent workloads.
Kafka‑based log replay replaces the native LogSplit mechanism: the offset of each WAL entry becomes the Kafka offset, which is propagated from the client to RegionServers via Mutation attributes, allowing custom SplitLogManager, TaskFinisher, and TaskExecutor implementations that operate on Kafka partitions instead of HDFS files.
Overall, the architecture redesign reduces IO consumption, shortens MTTR, supports AP scenarios, and provides flexible active/active multi‑datacenter resilience.
Big Data Technology Architecture
Exploring Open Source Big Data and AI Technologies
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.