Databases 11 min read

JDHBase Multi‑Active Architecture and Replication Mechanisms

This article describes JDHBase’s large‑scale KV storage, its HBase‑based replication principle, the multi‑active cluster architecture with Fox Manager, client routing, automatic failover, dynamic replication tuning, serial replication guarantees, and future directions for improving cross‑region disaster recovery.

Big Data Technology & Architecture

Jun 22, 2020

JDHBase Multi‑Active Architecture and Replication Mechanisms

JDHBase is JD.com’s online KV store serving thousands of business services (over 700) with more than 7,000 nodes and 90 PB of storage, handling trillion‑level read/write requests during major sales events.

The system relies on HBase’s LSM‑based architecture, where writes are first stored in MemStore and appended to WAL logs; replication is performed asynchronously by ReplicationSource threads reading WAL entries and sending them via RPC to ReplicationSink threads on the standby cluster.

To achieve multi‑active availability across data centers, JDHBase introduces a three‑component client interaction model: Client, JDHBase Cluster, and Fox Manager. The client authenticates with Fox Manager, receives cluster connection info, and establishes an HConnection for data operations.

Fox Manager’s configuration center (Policy Server) maintains user and cluster metadata, offering optional rule‑engine plugins for dynamic configuration based on cluster state and business requirements.

The JDHBase cluster consists of an Active Cluster handling normal traffic and a Standby Cluster that takes over during failures; data is asynchronously replicated between them, ensuring eventual consistency while preventing data loops via cluster IDs.

Cluster switching is transparent to the client: after obtaining the appropriate ZooKeeper address from Fox Manager, the client follows the standard HBase routing steps (discover META, locate region, interact with the region server). Fox Manager also supplies retry and timeout parameters, and metrics are collected to trigger automatic client‑side failover when availability drops.

Automatic failover is driven by a Policy Server that stores strategies in MySQL and uses a Raft‑based Rule Engine to evaluate HMaster‑reported health metrics, enabling second‑level recovery.

Dynamic replication parameters allow on‑the‑fly tuning of ReplicationSource threads to alleviate write‑hotspot backlogs without restarting RegionServers; the system can also automatically adjust these parameters based on observed queue lengths.

To guarantee data order across regions, JDHBase implements Serial Replication (back‑ported from HBase v2.1) using Barriers and lastPushedSequenceId recorded in ZooKeeper, ensuring that a RegionServer only pushes data after the previous server has completed its replication for the same region.

In summary, JDHBase has evolved from having no disaster‑recovery measures to achieving a 99.98 % SLA through multi‑active clusters, monitoring, alerting, automatic switching, and consistency mechanisms, while future work will focus on synchronous replication, reducing Zookeeper dependence, client‑side auto‑switching, and minimizing data redundancy.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

HBase replication multi-active cluster management JDHBase

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.