Backend Development 9 min read

Cache Big‑Key and Hot‑Key Issues: Case Study, Root‑Cause Analysis, and Mitigation Strategies

A promotional event created an oversized Redis cache entry that, combined with cache‑penetration bursts, saturated network bandwidth and caused a service outage, prompting mitigation through Protostuff serialization, gzip compression, request throttling, and enhanced monitoring, while recommending design‑time cache planning and stress testing to prevent future big‑key failures.

JD Retail Technology

Feb 7, 2025

Cache Big‑Key and Hot‑Key Issues: Case Study, Root‑Cause Analysis, and Mitigation Strategies

In modern software architecture, caching is a crucial technique for improving system performance and response speed. However, improper use of cache can lead to serious production incidents, especially when dealing with large hot keys. This article examines a common yet often overlooked problem: cache big‑key and cache penetration, using a real‑world incident to analyze causes and propose solutions and preventive measures.

Case Description

During a major promotional event, a system created a large activity with many conditions and rewards, resulting in an excessively large cache entry. After the activity went live, the system began to generate various alerts: the core UMP monitoring availability dropped from 100% to 20%, Redis call frequency and query performance plummeted, and a cascade of failures affected multiple core interfaces, ultimately rendering the entire service unavailable.

Root Cause Analysis

The development team used Redis as the cache store, saving each activity as a key‑value pair. For particularly large activities, a local JVM cache (5‑minute TTL) was added as a first‑level cache before falling back to Redis. Despite this precaution, two cache traps emerged:

Cache penetration: when the local cache missed, a massive number of concurrent requests hit Redis simultaneously, overwhelming it.

Network bandwidth bottleneck: the hot key size reached 1.5 MB. JD’s Redis limits single‑shard bandwidth to 200 MB/s, which translates to roughly 133 concurrent accesses for such a key. The combined effect of cache penetration and the large hot key quickly saturated the bandwidth limit, causing Redis threads to block and leading to a cache avalanche.

The original query method (pseudocode) is shown below:

ActivityCache present = activityLocalCache.getIfPresent(activityDetailCacheKey);</code>
<code>if (present != null) {</code>
<code>    ActivityCache activityCache = incentiveActivityPOConvert.copyActivityCache(present);</code>
<code>    return activityCache;</code>
<code>}</code>
<code>ActivityCache remoteCache = getCacheFromRedis(activityDetailCacheKey);</code>
<code>activityLocalCache.put(activityDetailCacheKey, remoteCache);</code>
<code>return remoteCache;

Solution

Big‑key mitigation: switch the cache object serialization from JSON to Protostuff, reducing the object size from 1.5 MB to 0.5 MB.

Compression: apply gzip compression to cached objects (with a threshold) to shrink data size (e.g., 500 KB compressed to 17 KB).

Cache fallback optimization: after a local cache miss, add a thread lock when querying Redis to limit concurrent fallback requests.

Monitoring and Redis configuration tuning: regularly monitor Redis network usage and adjust rate‑limit settings to ensure stable operation.

After mitigation, the revised cache retrieval code is:

ActivityCache present = activityLocalCache.get(activityDetailCacheKey, key -> getCacheFromRedis(key));</code>
<code>if (present != null) {</code>
<code>    return present;</code>
<code>}

Additional binary‑cache retrieval example:

/**</code>
<code>* Query binary cache</code>
<code>* @param activityDetailCacheBinKey</code>
<code>* @return</code>
<code>*/</code>
<code>private ActivityCache getBinCacheFromJimdb(String activityDetailCacheBinKey) {</code>
<code>    List<byte[]> activityByteList = slaveCluster.hMget(activityDetailCacheBinKey.getBytes(), "stock".getBytes());</code>
<code>    if (activityByteList.get(0) != null && activityByteList.get(0).length > 0) {</code>
<code>        byte[] decompress = ByteCompressionUtil.decompress(activityByteList.get(0));</code>
<code>        ActivityCache activityCache = ProtostuffUtil.deserialize(decompress, ActivityCache.class);</code>
<code>        if (activityCache != null) {</code>
<code>            if (activityByteList.get(1) != null && activityByteList.get(1).length > 0) {</code>
<code>                activityCache.setAvailableStock(Integer.valueOf(new String(activityByteList.get(1))));</code>
<code>            }</code>
<code>            return activityCache;</code>
<code>        }</code>
<code>    }</code>
<code>    return null;</code>
<code>}

Preventive Measures

Design‑phase cache strategy: consider cache usage scenarios and data characteristics early to avoid blind adoption of large keys.

Stress testing and performance evaluation: simulate high concurrency and large data volumes before release to uncover potential issues.

Regular system optimization and upgrades: continuously refine the system architecture and adopt new tools to maintain performance and stability.

Conclusion

Cache big‑keys and hot‑keys are common pitfalls. Ignoring them can cause severe production incidents. By analyzing this case and applying the outlined solutions, developers can better understand and handle such problems. Proper cache usage is key to system performance, not indiscriminately caching all data.

JD Retail Technology

Official platform of JD Retail Technology, delivering insightful R&D news and a deep look into the lives and work of technologists.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.