Databases 12 min read

JDHBase Multi‑Active Architecture and Asynchronous Replication Practices

This article describes JDHBase’s large‑scale KV storage architecture, its HBase‑based asynchronous replication mechanism, multi‑active cluster design, client‑side routing via Fox Manager, automatic failover strategies, dynamic replication tuning, and serial replication techniques to ensure data consistency across geographically distributed data centers.

Big Data Technology Architecture
Big Data Technology Architecture
Big Data Technology Architecture
JDHBase Multi‑Active Architecture and Asynchronous Replication Practices

JDHBase serves as JD.com’s online KV store, handling billions of read/write requests daily across more than 7,000 nodes and 90 PB of data, supporting over 700 business scenarios such as orders, recommendations, and logistics.

JDHBase relies on HBase’s LSM‑based storage; writes are first stored in Memstore and WAL logs, ensuring durability. Replication is WAL‑based, with ReplicationSource threads pushing data from the primary cluster and ReplicationSink threads applying it to the standby cluster, typically achieving second‑level latency.

The multi‑active architecture consists of three components: the Client, the JDHBase cluster, and the Fox Manager configuration center. The Client authenticates with Fox Manager, receives cluster connection info, and establishes an HConnection to interact with the appropriate cluster.

Fox Manager provides a Policy Server (stateless service with optional MySQL/Zookeeper persistence), a Service Center UI, and a VIP Load Balancer. The JDHBase cluster is split into an Active Cluster (normal operation) and a Standby Cluster (failover), with asynchronous replication ensuring eventual consistency.

Client‑side routing adds a step to query Fox Manager for user authentication, cluster information, and client parameters, allowing transparent switching between active and standby clusters without client code changes.

To reduce manual failover latency, JDHBase implements policy‑driven automatic failover: HMaster reports health metrics to Fox Manager’s Policy Server, whose Rule Engine (Raft‑based for high availability) evaluates strategies and triggers cluster switches within seconds.

Dynamic replication parameters can be adjusted on‑the‑fly without restarting RegionServers; an automatic tuning module on the ReplicationSource side monitors backlog and scales parameters between 1‑2× of the baseline. Time‑based throttling is also applied to respect inter‑data‑center bandwidth limits.

Serial (ordered) replication is introduced to avoid consistency anomalies caused by out‑of‑order WAL delivery during region moves. Barriers and lastPushedSequenceId stored in ZooKeeper ensure that a RegionServer only pushes data after the previous server has completed its replication for the preceding sequence range.

In summary, JDHBase has evolved from no disaster‑recovery measures to a mature system with 99.98 % SLA, featuring monitoring, alerting, automatic failover, and consistency guarantees. Future work will focus on synchronous replication, removing ZooKeeper dependencies, client‑side automatic switching, and reducing data redundancy.

multi-activeconsistencydistributed-storageFailoverDynamic TuningHBase ReplicationJDHBase
Big Data Technology Architecture
Written by

Big Data Technology Architecture

Exploring Open Source Big Data and AI Technologies

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.