JIMDB’s Big-Hot Key Solution: Optimizing Distributed Cache Performance

JIMDB, a high‑performance Redis‑based distributed cache, introduces the “Big‑Hot Key” concept to dynamically identify keys that strain CPU or bandwidth, and implements a multi‑layer active governance framework—including real‑time detection, server‑side caching, circuit‑breaker, and client‑side consistency—to dramatically reduce resource consumption and boost throughput.

JD Retail Technology
JD Retail Technology
JD Retail Technology
JIMDB’s Big-Hot Key Solution: Optimizing Distributed Cache Performance

01 Chapter: Redefining Performance Bottlenecks – The "Big‑Hot Key" Paradigm

JIMDB is a Redis‑based distributed cache service with high performance, high availability, and online scaling capabilities, serving various business units within JD.com. This article focuses on the optimization and thinking around "Big‑Hot Keys" in JIMDB.

Challenge

High‑performance distributed cache systems such as JIMDB/Redis have long suffered from "Big Key" and "Hot Key" problems, which can exhaust single‑node resources, cause response latency, and even trigger avalanche failures, threatening online service stability. Traditional static‑threshold definitions and passive mitigation are insufficient for increasingly complex, high‑throughput scenarios.

Innovation

The JIMDB team proposes the "Big‑Hot Key" concept, shifting the definition from static key attributes to the actual resource impact on CPU and network bandwidth, enabling precise identification of operations that pressure the service.

Solution

JIMDB builds a complete, multi‑layer proactive governance framework. The core is a sophisticated server‑side identification engine that actively discovers risky keys in real time, complemented by a suite of automated and manual tools, including server‑side automatic caching, command request circuit‑breaker, blacklist mechanisms, and intelligent client‑side caching that ensures data consistency.

Results

Online tests show significant improvements. Using only the server‑side caching of serialized results, CPU usage of an instance drops from 100% overload to about 30%, while effective throughput (OPS) rises from ~6,700 to nearly 12,000, an ~80% performance gain and a substantial reduction in stability risk.

Significance

This shift represents a technical transformation from a passive, manually‑intervened response model to an embedded, proactive, highly automated governance model, greatly reducing operational burden and easing developers' mental load.

02 Chapter: Intelligent Perception Layer – JIMDB’s Multi‑Dimensional Identification Engine

After defining "Big‑Hot Key" based on resource impact, the next challenge is building an engine that can identify these risks in real time, accurately, and with low overhead. JIMDB’s server implements a multi‑layer detection strategy, not just simple threshold monitoring.

2.1 Three‑Tier Waterfall Detection Principle

The engine follows a "simple‑to‑complex, fast‑fail" principle, performing three checks sequentially when a command is executed. If any condition is met, the key is marked as a Big‑Hot Key and later checks are skipped:

Bandwidth bottleneck detection.

Collection size bottleneck detection.

CPU compute bottleneck detection.

This ordering ensures the cheapest checks run first, minimizing detection overhead.

2.2 Strategy 1 – Bandwidth Saturation Early Warning

For commands with low compute complexity but large response data, network bandwidth becomes the bottleneck. The strategy compares the instantaneous bandwidth of the command with a configurable percentage (e.g., 70%) of the instance’s total output bandwidth.

When the bandwidth exceeds the threshold, the command is classified as a Big‑Hot Key, and a "Big‑Hot Key Index" approximates the per‑second output bytes, reflecting network pressure.

Bandwidth calculation diagram
Bandwidth calculation diagram

2.3 Strategy 2 – Static Risk Pre‑Check for Collection Size

For collection types (List, Hash, Set, ZSet), if the element count exceeds 100,000 or the single‑command response size exceeds 1 MB, the operation is directly flagged as a Big‑Hot Key, regardless of QPS.

Element count > 100 k.

Response size > 1 MB.

This pre‑emptive marking helps expose hidden "data bombs" before they cause overload.

2.4 Strategy 3 – Machine‑Learning‑Based CPU Load Prediction

For operations not caught by the first two strategies but still potentially CPU‑intensive, JIMDB uses a predictive model. It compares the actual QPS of a specific Key+Command+Params tuple with a predicted maximum QPS derived from a hybrid polynomial‑regression and linear‑interpolation model.

If actual QPS exceeds the predicted limit, the key is deemed a Big‑Hot Key. The resulting index grows quadratically with the overload ratio, providing a clear risk priority.

CPU prediction model diagram
CPU prediction model diagram

03 Chapter: Multi‑Layer Deep Defense – JIMDB’s Big‑Hot Key Governance Suite

Accurate identification is the premise; comprehensive governance ensures system stability.

3.1 Server‑Side Automatic Mitigation

When a Big‑Hot Key is detected, the server automatically caches the serialized response for read‑intensive keys, allowing subsequent identical requests to be served from memory without re‑traversing the data structure, drastically reducing CPU load.

3.2 Automatic Circuit‑Breaker (disabled by default)

If enabled, the circuit‑breaker rejects all further requests for the offending key, returning an error such as "ERR request rejected by circuit breaker for risk command". This provides an immediate safeguard against destructive access patterns.

3.3 Smart Client Collaboration – SDK Caching with Version Verification

The SDK caches the value and a version number on the client side after the first request. Subsequent requests send only the key and cached version; the server replies with a short confirmation if the version matches, otherwise it returns the fresh value and new version. This guarantees strong consistency while cutting network traffic.

Client‑server version check flow
Client‑server version check flow

3.4 Manual Intervention Tools

When automation cannot cover a scenario, operators can manually add keys or commands to a blacklist, achieving flexible, controllable circuit‑breaking without data deletion.

3.5 High‑Availability Management Port

An independent management port runs on a separate thread, allowing administrators to diagnose, blacklist, or delete keys even if the main thread is blocked by a Big‑Hot Key operation.

04 Chapter: Empirical Analysis – Quantitative Evaluation in Production

4.1 Test Methodology and Environment

Tests were conducted on physical machines (8 CPU cores, 16 GB RAM). Two versions were compared: a baseline without Big‑Hot Key features and an optimized version with active detection and server‑side caching enabled.

4.2 Scenario 1 – CPU‑Intensive Hotspot

Clients repeatedly executed LRANGE on a List of 2,000 elements (≈15 B each). The baseline saturated CPU at 100 % and achieved ~6,700 OPS. The optimized version reduced CPU to ~30 % and increased OPS to ~12,000, eliminating alerts.

4.3 Scenario 2 – Bandwidth‑Intensive Hotspot

When a command generated a massive response (≈228 MB/s outbound), the baseline hit both CPU and bandwidth limits. Enabling the circuit‑breaker immediately rejected the request, restoring normal bandwidth and CPU usage.

4.4 Performance Gains Summary

Overall, CPU usage dropped by ~70 % (from 100 % to 30 %), throughput increased by ~79 % (from ~6,700 OPS to ~12,000 OPS), and system stability improved from high‑risk alerts to stable, alert‑free operation.

05 Chapter: Deployment Status, Upgrade Path, and Future Roadmap

5.1 Current Deployment Scale and Impact

Over 3,000 instances have been upgraded, each successfully identifying and mitigating Big‑Hot Keys, eliminating related port‑blocking alerts.

5.2 Upgrade Strategy

The server can be upgraded independently of clients, with a typical shard upgrade taking about ten minutes and having no impact on normal read/write traffic.

5.3 Future Evolution

JIMDB aims to make Big‑Hot Key governance a platform‑level, standardized, automated service, extending to broader risk management across the ecosystem.

06 Chapter: Technical Value and Industry Perspective

6.1 Comparison with Industry Hot‑Key Solutions

Common approaches include adding read replicas, deploying proxy layers, client‑side caching, or key sharding. These are external patches with higher cost or complexity. JIMDB’s native Big‑Hot Key solution is deeply integrated, shifting responsibility from developers to the platform and providing consistent, reliable behavior.

6.2 Value of Embedded Automated Governance

Embedding the solution reduces hardware and operational costs, lowers engineering effort, and improves system resilience by providing proactive detection, automatic mitigation, and unified handling of hotspot issues.

07 Chapter: Conclusion and Thoughts

JIMDB redefines Big‑Hot Keys, builds a systematic detection and mitigation framework, and has achieved large‑scale production deployment with measurable performance and stability improvements.

Key Takeaways:

Redefine hotspots based on real resource impact rather than static thresholds.

Implement a multi‑layer proactive governance system covering detection, caching, circuit‑breaking, and client‑side consistency.

Deploy at scale to achieve both stability and performance gains.

Join us! Position: Distributed KV Storage R&D Engineer, Location: Beijing Yizhuang, Contact: [email protected]

Recruitment image
Recruitment image
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Performance OptimizationRedisResource ManagementJimdbautomatic governance
JD Retail Technology
Written by

JD Retail Technology

Official platform of JD Retail Technology, delivering insightful R&D news and a deep look into the lives and work of technologists.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.