Databases 23 min read

How JD’s JIMDB Achieves Zero‑Downtime Scaling and Automatic Failover for Massive Caches

JIMDB is JD’s in‑house distributed cache platform that combines automatic fault detection, seamless online scaling, multi‑language support, and containerized deployment to replace traditional Memcached/Redis solutions, offering features such as one‑click cluster creation, elastic expansion, lossless scaling, and comprehensive monitoring for high‑traffic e‑commerce services.

dbaplus Community

Jan 15, 2017

How JD’s JIMDB Achieves Zero‑Downtime Scaling and Automatic Failover for Massive Caches

Background of Caching in Large‑Scale Web Applications

Caching is ubiquitous in modern internet services, from database layers to front‑end pages, and even hardware components such as CPU caches. In e‑commerce, a single page may require dozens of dynamic services to personalize content, demanding millisecond‑level response times under thousands of requests per second.

Limitations of Early Memcached and Redis Deployments

Initial teams used a small number of Memcached instances with static hash‑based sharding and manual failover. Scaling required service downtime to redistribute data. When Redis became popular, teams migrated but retained the same static configuration pattern, leading to frequent outages during hardware failures and capacity expansions.

Why a New Cache Platform Was Needed

Frequent hardware failures, growing cluster size, and increasing client count highlighted the need for a platform that could automatically recover from faults, perform online scaling, and balance load without client‑side configuration changes.

JIMDB Overview

JIMDB is JD’s internally built distributed cache system. Its key characteristics include:

One‑click cluster creation.

Fully automatic elastic scaling.

Partial‑copy expansion to shorten scaling time.

Online seamless upgrades.

Automatic fault recovery.

Multi‑language SDKs and Redis‑compatible access points.

Containerized deployment with Docker image management.

Incremental data replication.

Architecture

Clients create a cluster via a web console. An embedded SDK or a Redis‑compatible proxy fetches topology information from a Config Server. When a request targets a key, the client follows the routing algorithm to a specific node. The Config Server updates topology after scaling or failover and notifies clients.

Core modules:

Server : KV service with primary‑replica model, asynchronous replication, and slot‑based partial data migration.

Config Server : Maintains cluster topology and pushes updates to clients.

Sentinel : Monitors instance health across racks to avoid single‑point failures.

Failover : Handles role switching and instance replacement.

Scaler : Splits shards based on memory or traffic thresholds.

Info Collector : Gathers monitoring data.

Resource Manager : Manages physical resources and container creation.

First Release Goals

JIMDB was launched in early 2014 to solve three core problems:

Accurate fault detection and automatic failover.

Lossless scaling.

Integrated monitoring and alerting.

Clients use consistent hashing with a large slot space; each slot maps to a shard composed of one primary and one or more replicas. When a primary fails, a replica is promoted without client interruption.

Fault Detection Mechanism

Instead of Zookeeper, JIMDB runs multiple probe agents in different racks. An instance is considered alive if any probe reports it as healthy; it is declared dead only when a majority of probes (over half) report failure, after which the failover component updates the topology and notifies all clients.

Lossless Scaling Strategy

During expansion, a shard’s slot range is split. The original shard’s data is partially migrated to a new instance. Because the server initially lacked slot awareness, JIMDB introduced a proxy layer that temporarily routes requests for the expanding shard to the proxy, which forwards them to the appropriate backend while data is copied. After migration, the new topology is pushed to clients.

Later, the server itself was enhanced with slot metadata, allowing the server to determine whether a key belongs to its own slots. This enabled slot‑level migration without full data scans, reduced network traffic, and eliminated the need for a proxy during scaling.

Second‑Generation Enhancements

As the cluster grew, additional capabilities were added:

Automatic Elastic Scheduling : Monitors OPS, memory, and network metrics; triggers scaling when thresholds are crossed; and performs coordinated scaling to avoid resource contention on a single machine.

Docker‑Based Server Upgrades : Uses Docker images for versioned deployment, enabling zero‑downtime upgrades and fast instance replacement.

Resource Isolation : Divides clusters into partitions based on workload characteristics, limiting the number of instances per physical machine.

Large‑Key Handling : Detects and isolates keys with massive values or collections to prevent CPU spikes and latency spikes.

Read Strategies : Supports master‑first, round‑robin, or random read policies to distribute load for hot keys.

Future Work and Open Challenges

Planned improvements include richer data structures (e.g., table‑like types), versioned data structures, hashtag support for key distribution, more sophisticated monitoring dashboards, client‑side caching, asynchronous client writes, emergency migration of large keys, and range‑scan capabilities for ordered keys.

Practical Case Studies

Real‑world incidents illustrate common pitfalls:

Fetching an entire list with LRANGE key 0 -1 on a list that grew to hundreds of thousands of items caused severe latency; the remedy is to limit batch size or prune the list.

Storing multi‑megabyte values in a single key led to TP99 latency spikes; splitting the value resolves the issue.

Linear or exponential growth of write traffic on a single key overwhelms a shard; sharding the key space or adding replicas mitigates the bottleneck.

Hot keys without local caching increase remote request latency; adding a client‑side cache improves performance.

Overall, JIMDB demonstrates how a purpose‑built cache platform can provide automatic fault tolerance, lossless scaling, and operational visibility for high‑traffic e‑commerce workloads.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

distributed systems Cache fault tolerance Elastic Scaling Key-Value Store

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.