JIMDB’s Big-Hot Key Solution: Optimizing Distributed Cache Performance
JIMDB, a high‑performance Redis‑based distributed cache, introduces the “Big‑Hot Key” concept to dynamically identify keys that strain CPU or bandwidth, and implements a multi‑layer active governance framework—including real‑time detection, server‑side caching, circuit‑breaker, and client‑side consistency—to dramatically reduce resource consumption and boost throughput.
01 Chapter: Redefining Performance Bottlenecks – The "Big‑Hot Key" Paradigm
JIMDB is a Redis‑based distributed cache service with high performance, high availability, and online scaling capabilities, serving various business units within JD.com. This article focuses on the optimization and thinking around "Big‑Hot Keys" in JIMDB.
Challenge
High‑performance distributed cache systems such as JIMDB/Redis have long suffered from "Big Key" and "Hot Key" problems, which can exhaust single‑node resources, cause response latency, and even trigger avalanche failures, threatening online service stability. Traditional static‑threshold definitions and passive mitigation are insufficient for increasingly complex, high‑throughput scenarios.
Innovation
The JIMDB team proposes the "Big‑Hot Key" concept, shifting the definition from static key attributes to the actual resource impact on CPU and network bandwidth, enabling precise identification of operations that pressure the service.
Solution
JIMDB builds a complete, multi‑layer proactive governance framework. The core is a sophisticated server‑side identification engine that actively discovers risky keys in real time, complemented by a suite of automated and manual tools, including server‑side automatic caching, command request circuit‑breaker, blacklist mechanisms, and intelligent client‑side caching that ensures data consistency.
Results
Online tests show significant improvements. Using only the server‑side caching of serialized results, CPU usage of an instance drops from 100% overload to about 30%, while effective throughput (OPS) rises from ~6,700 to nearly 12,000, an ~80% performance gain and a substantial reduction in stability risk.
Significance
This shift represents a technical transformation from a passive, manually‑intervened response model to an embedded, proactive, highly automated governance model, greatly reducing operational burden and easing developers' mental load.
02 Chapter: Intelligent Perception Layer – JIMDB’s Multi‑Dimensional Identification Engine
After defining "Big‑Hot Key" based on resource impact, the next challenge is building an engine that can identify these risks in real time, accurately, and with low overhead. JIMDB’s server implements a multi‑layer detection strategy, not just simple threshold monitoring.
2.1 Three‑Tier Waterfall Detection Principle
The engine follows a "simple‑to‑complex, fast‑fail" principle, performing three checks sequentially when a command is executed. If any condition is met, the key is marked as a Big‑Hot Key and later checks are skipped:
Bandwidth bottleneck detection.
Collection size bottleneck detection.
CPU compute bottleneck detection.
This ordering ensures the cheapest checks run first, minimizing detection overhead.
2.2 Strategy 1 – Bandwidth Saturation Early Warning
For commands with low compute complexity but large response data, network bandwidth becomes the bottleneck. The strategy compares the instantaneous bandwidth of the command with a configurable percentage (e.g., 70%) of the instance’s total output bandwidth.
When the bandwidth exceeds the threshold, the command is classified as a Big‑Hot Key, and a "Big‑Hot Key Index" approximates the per‑second output bytes, reflecting network pressure.
2.3 Strategy 2 – Static Risk Pre‑Check for Collection Size
For collection types (List, Hash, Set, ZSet), if the element count exceeds 100,000 or the single‑command response size exceeds 1 MB, the operation is directly flagged as a Big‑Hot Key, regardless of QPS.
Element count > 100 k.
Response size > 1 MB.
This pre‑emptive marking helps expose hidden "data bombs" before they cause overload.
2.4 Strategy 3 – Machine‑Learning‑Based CPU Load Prediction
For operations not caught by the first two strategies but still potentially CPU‑intensive, JIMDB uses a predictive model. It compares the actual QPS of a specific Key+Command+Params tuple with a predicted maximum QPS derived from a hybrid polynomial‑regression and linear‑interpolation model.
If actual QPS exceeds the predicted limit, the key is deemed a Big‑Hot Key. The resulting index grows quadratically with the overload ratio, providing a clear risk priority.
03 Chapter: Multi‑Layer Deep Defense – JIMDB’s Big‑Hot Key Governance Suite
Accurate identification is the premise; comprehensive governance ensures system stability.
3.1 Server‑Side Automatic Mitigation
When a Big‑Hot Key is detected, the server automatically caches the serialized response for read‑intensive keys, allowing subsequent identical requests to be served from memory without re‑traversing the data structure, drastically reducing CPU load.
3.2 Automatic Circuit‑Breaker (disabled by default)
If enabled, the circuit‑breaker rejects all further requests for the offending key, returning an error such as "ERR request rejected by circuit breaker for risk command". This provides an immediate safeguard against destructive access patterns.
3.3 Smart Client Collaboration – SDK Caching with Version Verification
The SDK caches the value and a version number on the client side after the first request. Subsequent requests send only the key and cached version; the server replies with a short confirmation if the version matches, otherwise it returns the fresh value and new version. This guarantees strong consistency while cutting network traffic.
3.4 Manual Intervention Tools
When automation cannot cover a scenario, operators can manually add keys or commands to a blacklist, achieving flexible, controllable circuit‑breaking without data deletion.
3.5 High‑Availability Management Port
An independent management port runs on a separate thread, allowing administrators to diagnose, blacklist, or delete keys even if the main thread is blocked by a Big‑Hot Key operation.
04 Chapter: Empirical Analysis – Quantitative Evaluation in Production
4.1 Test Methodology and Environment
Tests were conducted on physical machines (8 CPU cores, 16 GB RAM). Two versions were compared: a baseline without Big‑Hot Key features and an optimized version with active detection and server‑side caching enabled.
4.2 Scenario 1 – CPU‑Intensive Hotspot
Clients repeatedly executed LRANGE on a List of 2,000 elements (≈15 B each). The baseline saturated CPU at 100 % and achieved ~6,700 OPS. The optimized version reduced CPU to ~30 % and increased OPS to ~12,000, eliminating alerts.
4.3 Scenario 2 – Bandwidth‑Intensive Hotspot
When a command generated a massive response (≈228 MB/s outbound), the baseline hit both CPU and bandwidth limits. Enabling the circuit‑breaker immediately rejected the request, restoring normal bandwidth and CPU usage.
4.4 Performance Gains Summary
Overall, CPU usage dropped by ~70 % (from 100 % to 30 %), throughput increased by ~79 % (from ~6,700 OPS to ~12,000 OPS), and system stability improved from high‑risk alerts to stable, alert‑free operation.
05 Chapter: Deployment Status, Upgrade Path, and Future Roadmap
5.1 Current Deployment Scale and Impact
Over 3,000 instances have been upgraded, each successfully identifying and mitigating Big‑Hot Keys, eliminating related port‑blocking alerts.
5.2 Upgrade Strategy
The server can be upgraded independently of clients, with a typical shard upgrade taking about ten minutes and having no impact on normal read/write traffic.
5.3 Future Evolution
JIMDB aims to make Big‑Hot Key governance a platform‑level, standardized, automated service, extending to broader risk management across the ecosystem.
06 Chapter: Technical Value and Industry Perspective
6.1 Comparison with Industry Hot‑Key Solutions
Common approaches include adding read replicas, deploying proxy layers, client‑side caching, or key sharding. These are external patches with higher cost or complexity. JIMDB’s native Big‑Hot Key solution is deeply integrated, shifting responsibility from developers to the platform and providing consistent, reliable behavior.
6.2 Value of Embedded Automated Governance
Embedding the solution reduces hardware and operational costs, lowers engineering effort, and improves system resilience by providing proactive detection, automatic mitigation, and unified handling of hotspot issues.
07 Chapter: Conclusion and Thoughts
JIMDB redefines Big‑Hot Keys, builds a systematic detection and mitigation framework, and has achieved large‑scale production deployment with measurable performance and stability improvements.
Key Takeaways:
Redefine hotspots based on real resource impact rather than static thresholds.
Implement a multi‑layer proactive governance system covering detection, caching, circuit‑breaking, and client‑side consistency.
Deploy at scale to achieve both stability and performance gains.
Join us! Position: Distributed KV Storage R&D Engineer, Location: Beijing Yizhuang, Contact: [email protected]
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
JD Retail Technology
Official platform of JD Retail Technology, delivering insightful R&D news and a deep look into the lives and work of technologists.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
