Databases 21 min read

How Alibaba Scales HBase for High Availability: 10‑Year Lessons from Production

This article reviews Alibaba's decade‑long evolution of HBase high‑availability, covering large‑cluster design, MTTF/MTTR metrics, disaster‑recovery strategies, traffic switching, and performance optimizations that together enable millions of requests per second with near‑zero downtime.

dbaplus Community

Dec 11, 2019

How Alibaba Scales HBase for High Availability: 10‑Year Lessons from Production

Preface

Since 2011, Alibaba has integrated HBase into its technology stack, collaborating with most business units such as Taobao, Alipay, and Cainiao. By the 2018 Double‑11 event, the system processed 2.4 trillion rows per day, demonstrating the need for robust high‑availability solutions.

Large Clusters

Running a separate HBase cluster per business quickly becomes operationally costly and wastes resources. Starting in 2013, Alibaba moved to a large‑cluster model with over 700 nodes per cluster, introducing a group concept that shares storage while isolating compute.

Each group contains at least one server; a server belongs to only one group at a time but can be moved between groups for scaling. Tables are bound to a single group, allowing physical isolation of CPU and memory while sharing the underlying HDFS storage pool.

Bad disks can affect shared storage because HDFS writes replicate blocks to three nodes. Alibaba mitigates this by monitoring slow/bad disks, shortening impact time, and writing only two replicas when the third times out.

To protect Zookeeper from excessive client connections, Alibaba limits per‑IP connections and implements a client‑server link separation solution (HBASE‑20159).

MTTF & MTTR

Reliability is measured by Mean Time To Failure (MTTF) and Mean Time To Repair (MTTR). Failure sources include hardware, software bugs, operational errors, overload, and dependent component outages.

Hardware failures such as bad disks or network cards.

Software defects (bugs or performance bottlenecks).

Operational mistakes.

Service overload from spikes or large objects.

Dependency failures (e.g., HDFS or Zookeeper).

Key incidents:

Periodic Full GC (FGC) causing process aborts; solved by developing BucketCache and SharedBucketCache to reduce memory fragmentation.

HDFS write failures leading to abrupt process termination; addressed by improving exception handling, adding retries, and abandoning timed‑out third replicas.

Large concurrent scans causing CPU or I/O overload; mitigated with request monitoring, throttling, and automatic interruption of long‑running large requests.

Slow region splits for massive partitions; solved with a “cascading split” technique that bypasses compaction between split rounds.

Alibaba also built a “health diagnosis” system to pre‑emptively alert on abnormal metrics before they cause outages.

Disaster Recovery

Disaster recovery relies on isolation‑based redundancy at the resource, software, and operational layers. Alibaba has deployed intra‑city active‑standby and multi‑region active‑active setups, with 99% of clusters having at least one backup.

Data replication choices (synchronous vs. asynchronous, ordered vs. unordered) depend on business consistency requirements. Alibaba primarily uses asynchronous replication, enhancing it with monitoring, remote consumption agents, and eventually extracting the replication component into an independent service called BDS Replication .

Traffic Switching

During a disaster, traffic must be switched from the primary to the standby cluster. Alibaba modified the HBase client to perform internal failover, closing old connections and opening new ones to the backup cluster.

Meta service overload during massive switch‑overs was alleviated by redesigning the Meta table cache and isolating Meta partitions from data partitions.

Automatic switching is pursued via a health‑score arbitrator that triggers failover when a cluster’s score falls below a threshold, as well as client‑side logic that switches based on request failure rates.

Ultimate Experience

For low‑latency risk‑control and recommendation workloads, Alibaba optimized the storage engine with CCSMAP, SharedBucketCache, IndexEncoding, lock‑free queues, coroutines, and ThreadLocal counters, achieving sub‑15 ms P999 latency on a single cluster.

DualService was introduced to allow clients to query both primary and standby clusters in parallel, returning the fastest response and achieving near‑zero latency spikes.

Limitations of the primary‑standby model include cluster‑level switch granularity, eventual‑consistency only, and coupling between replication and the storage engine. To overcome these, Alibaba developed the Lindorm engine with a dual‑Zone deployment, providing strong, session, and eventual consistency options and enabling partition‑level failover.

Conclusion

Alibaba’s HBase high‑availability journey emphasizes user‑centric design, failure‑aware engineering, comprehensive monitoring, isolation‑based redundancy, fine‑grained resource control, self‑protection mechanisms, and full‑stack tracing. These principles and concrete implementations can guide other teams building resilient distributed systems.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

High Availability HBase replication Alibaba Cloud failover

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.