Backend Development 13 min read

JDHBase Multi‑Active Architecture and Replication Practices

This article describes JDHBase’s large‑scale KV storage deployment, its HBase‑based asynchronous replication mechanism, the multi‑active architecture with active‑standby clusters, client interaction via Fox Manager, automatic failover strategies, dynamic replication tuning, and serial replication techniques to ensure data consistency across data centers.

JD Retail Technology

Jan 6, 2020

JDHBase Multi‑Active Architecture and Replication Practices

JDHBase serves as JD.com’s online KV store, handling billions of read/write requests daily across more than 7,000 nodes and 90 PB of storage, supporting over 700 business services such as orders, recommendations, finance, and logistics.

To guarantee uninterrupted operation, JDHBase implements a geographically distributed multi‑active system that replicates data between active and standby clusters.

HBase uses a Log‑Structured Merge‑Tree (LSM) architecture; writes are first stored in Memstore and appended to the Write‑Ahead Log (WAL) on HDFS, ensuring durability even after node failures.

Replication in HBase is WAL‑based: each RegionServer runs a ReplicationSource thread that reads WAL entries, filters them per configuration, and sends them via RPC to the backup cluster where a ReplicationSink thread converts them into put/delete operations.

This asynchronous replication typically incurs only second‑level latency for the standby cluster.

The JDHBase system consists of three main components: the Client, the JDHBase cluster, and the Fox Manager configuration center.

When a client starts, it reports user information to Fox Manager, which authenticates the user and returns connection details. The client then creates an HConnection to interact with the designated cluster.

Fox Manager provides a Policy Server (stateless service nodes with optional MySQL or Zookeeper persistence), a Service Center UI for administrators, and a VIP Load Balancer that offers a unified access address.

The JDHBase cluster delivers high‑throughput OLTP capabilities and supports active‑standby replication. The active cluster handles normal traffic while asynchronously replicating data to the standby cluster; the standby cluster takes over when failures occur, also replicating back to the active side.

Client data routing follows three steps: (1) obtain the Zookeeper address from Fox Manager, (2) query the META table to locate region information, and (3) interact with the target region server. JDHBase adds an extra step of contacting Fox Manager for authentication, cluster discovery, and client parameter retrieval.

Automatic failover is achieved through a policy‑driven mechanism: a status‑checking plugin on HMaster reports metrics to the Policy Server, whose Rule Engine evaluates these metrics (using Raft for high availability) and triggers cluster switches within seconds.

Dynamic replication parameters allow on‑the‑fly tuning of RegionServer settings to alleviate write‑heavy backlog without restarting nodes. Additionally, the system can automatically adjust replication speed based on observed backlog thresholds.

To guarantee ordering consistency, JDHBase implements serial replication using barriers and lastPushedSequenceId. When a region moves between servers, the new server waits until the previous server has pushed all prior WAL entries, ensuring that the standby cluster receives mutations in the same order as the primary.

In summary, JDHBase has evolved its disaster‑recovery capabilities to achieve a 99.98 % SLA, incorporating monitoring, alerting, automatic switching, and consistency guarantees. Future work will focus on synchronous replication, reducing Zookeeper dependence, client‑side automatic switching, and minimizing data redundancy.

References:

1. https://hbase.apache.org/book.html#_cluster_replication 2. https://mapr.com/blog/in-depth-look-hbase-architecture/ 3. https://issues.apache.org/jira/browse/HBASE-20360 4. https://issues.apache.org/jira/browse/HBASE-20046

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

High Availability HBase replication multi-active Distributed storage cluster management

Written by

JD Retail Technology

Official platform of JD Retail Technology, delivering insightful R&D news and a deep look into the lives and work of technologists.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.