Databases 20 min read

High‑Availability Practices of Alibaba HBase: Large Clusters, MTTF/MTTR, Disaster Recovery, and Extreme Experience

This article reviews Alibaba HBase's evolution toward high availability, covering large‑cluster architecture, reliability metrics (MTTF/MTTR), disaster‑recovery strategies such as data replication and traffic switching, performance optimizations for extreme latency requirements, and lessons learned for building resilient distributed database services.

Big Data Technology & Architecture

Oct 21, 2019

High‑Availability Practices of Alibaba HBase: Large Clusters, MTTF/MTTR, Disaster Recovery, and Extreme Experience

Introduction

Since 2011 Alibaba has integrated HBase into its technology stack, scaling it to support core services such as Double‑Eleven, Alipay billing, and logistics. By 2018 a single cluster handled 2.4 trillion rows per day, prompting a shift toward public‑cloud, high‑availability solutions.

Large Clusters

Operating one cluster per business quickly becomes inefficient; each cluster consumes dedicated Zookeeper, Master, and NameNode nodes. Alibaba moved to a large‑cluster model with >700 nodes per cluster, introducing a "group" concept that isolates compute while sharing storage, enabling dynamic scaling of groups and tables.

Shared HDFS storage introduces risks such as bad disks affecting many pipelines. Alibaba mitigates this by monitoring disk health, shortening impact time, and using a write‑quorum strategy that tolerates partial replica failures.

Client‑Zookeeper connections are limited per IP, and a client‑server link separation (HBASE‑20159) reduces heartbeat pressure on Zookeeper.

MTTF & MTTR

Reliability is measured by Mean Time To Failure (MTTF) and Mean Time To Repair (MTTR). Failure sources include hardware faults, software bugs, operational errors, overload, and dependent component outages.

Key incidents addressed:

Periodic Full GC causing process aborts – solved by BucketCache and SharedBucketCache.

HDFS write failures – improved exception handling and retry logic.

Large concurrent scans causing CPU/IO overload – introduced request monitoring, throttling, and automatic interruption of long‑running scans.

Slow region splits – implemented cascading splits to avoid repeated compactions.

To accelerate recovery, Alibaba redesigned the failover path (MTTR2), removing the Split Log step and optimizing region assignment, achieving >200 % faster recovery than the open‑source baseline.

A client‑side Fast Fail mechanism (DeadServerDetective) quickly discards requests to failed servers, preserving thread pool resources.

Disaster Recovery

Data Replication

Alibaba adopts asynchronous replication for most workloads, enhancing monitoring, reducing resource contention with remote consumers, and decoupling the replication component into a dedicated service (BDS Replication). This solves hotspot‑induced latency and tight coupling with HDFS.

Complex multi‑region topologies are monitored for link delay, and duplicate transmissions in ring‑like topologies are eliminated.

Traffic Switching

During a disaster, traffic is switched at the client side to the standby cluster via a high‑availability channel. Meta service overload is mitigated by caching and isolating Meta partitions.

Automatic switching evolves from manual alerts to a health‑score arbiter and finally to client‑side decision making based on failure rates.

Extreme Experience

For latency‑sensitive risk‑control and recommendation scenarios, Alibaba HBase achieves sub‑15 ms P99 latency by optimizing write/read caches (CCSMAP, SharedBucketCache, IndexEncoding) and leveraging ZGC. DualService enables parallel reads from primary and standby, reducing jitter to near zero.

Limitations of the primary‑standby model led to the development of the Lindorm engine, offering dual‑zone deployment, multi‑level consistency, and partition‑level failover.

Full‑link tracing (Trace) provides end‑to‑end request profiling for rapid issue diagnosis.

Conclusion

The article shares Alibaba HBase’s high‑availability practices, emphasizing user‑centric availability design, failure‑aware architecture, comprehensive monitoring, isolation‑based redundancy, fine‑grained resource control, self‑protection mechanisms, and traceability.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

distributed systems Performance Optimization High Availability Disaster Recovery HBase Databases

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.