Alibaba’s Secrets to Scaling HBase for PB‑Level Big Data
This article explains how Alibaba built, customized, and operated a massive HBase platform—covering its architecture, high‑availability design, asynchronous and synchronous replication, multi‑link data flow, cost‑aware redundancy, cross‑cluster migration, performance optimizations, and future directions for the distributed NoSQL database.
Overview
HBase is an open‑source, non‑relational distributed database (NoSQL) modeled after Google’s BigTable, offering high reliability, performance, and scalability on commodity servers. Originally a Hadoop sub‑project, it became an Apache top‑level project in 2010 and is now widely adopted by companies such as Facebook, Yahoo, and Alibaba.
HBase at Alibaba
Since 2011 Alibaba has used HBase as the core storage for Taobao, Tmall, Ant Financial, Cainiao, Alibaba Cloud, and other services, handling hundreds of GB/s of read/write traffic during peak events like Double‑11. The team built a one‑stop big‑data storage service covering software, solutions, stability, and development support.
High‑Availability Construction
Alibaba measures availability with SLA (e.g., 99.99% uptime means less than 52.6 minutes of downtime per year). To achieve high availability, data is replicated across multiple data centers, requiring consistent cross‑site copies and fault‑tolerant designs.
Cluster Asynchronous Replication
From HBase 0.92 onward, Replication asynchronously pushes incremental data from a primary cluster to a backup cluster, enabling disaster recovery. Alibaba improved source‑side sending efficiency, target‑side sink efficiency, and added hotspot‑assistance, online configuration, and multi‑link support.
Multi‑Link Data Flow
Multiple data links allow tables to replicate to one or more destinations, enabling flexible data routing, visual topology, loop‑avoidance, and link isolation to prevent a single congested link from affecting others.
Data Consistency
While most production systems use asynchronous replication (eventual consistency), Alibaba also provides strong‑consistency options: (1) a strong‑consistent switch that pauses writes on the primary until all data is replicated, and (2) synchronous replication where writes succeed only after both primary and backup have persisted the data.
Redundancy and Cost
Redundant cross‑cluster copies improve availability but double storage costs. Alibaba explores reducing replica counts (e.g., from three to two) and leveraging cross‑cluster partition replication to maintain resilience while lowering expense.
Cross‑Cluster Partition Replication
A job‑based system splits a table’s RowKey range into sub‑tasks dispatched by the master to region servers, enabling fast, fault‑tolerant, and resumable data migration between clusters.
Multi‑Cluster Active‑Active Service
Beyond traditional active‑standby, Alibaba implements client‑side dual‑cluster access, cross‑deployment, and load‑balancing to fully utilize both clusters and reduce latency spikes.
Additional Performance Work
Key optimizations include an asynchronous API, prefix BloomFilter for Scan operations, HLog compression compatible with replication, and coprocessor‑based built‑in calculations (Count, Avg, Sum, etc.) that dramatically reduce I/O and improve throughput.
Future Development
Upcoming focus areas are GC pause reduction via custom memory maps, SQL‑style access with global secondary indexes, and containerized deployment (Docker) for agile operations.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
