Cache Big‑Key and Hot‑Key Issues: Case Study, Root‑Cause Analysis, and Mitigation Strategies
This article examines a real‑world incident where oversized and frequently accessed Redis cache keys caused cache penetration and network bandwidth saturation during a high‑traffic promotion, analyzes the underlying reasons, and presents concrete solutions and preventive measures for backend systems.
In modern software architectures, caching is essential for performance, but misuse can lead to severe incidents, especially when dealing with large (big‑key) or frequently accessed (hot‑key) cache entries.
During a Double‑11 sales event, a system experienced a critical outage: a massive promotional activity generated an oversized cache entry, causing Redis call latency to spike, overall availability to drop from 100% to 20%, and a cascade of failures across core services.
The root causes were twofold: first, cache penetration—when the local JVM cache was empty, a flood of requests simultaneously queried Redis for the newly created key; second, network bandwidth bottleneck—each hot‑key was about 1.5 MB, exceeding the per‑shard bandwidth limit (200 Mbps, roughly 133 concurrent accesses), leading to Redis thread blockage and a cache avalanche.
Original cache‑lookup pseudocode:
ActivityCache present = activityLocalCache.getIfPresent(activityDetailCacheKey);
if (present != null) {
ActivityCache activityCache = incentiveActivityPOConvert.copyActivityCache(present);
return activityCache;
}
ActivityCache remoteCache = getCacheFromRedis(activityDetailCacheKey);
activityLocalCache.put(activityDetailCacheKey, remoteCache);
return remoteCache;To address the issue, the team implemented several measures:
Big‑key governance: switched serialization from JSON to Protostuff, reducing object size from 1.5 MB to 0.5 MB.
Compression: applied gzip compression with a threshold, shrinking a 500 KB payload to 17 KB.
Cache back‑source optimization: added a thread lock when the local cache missed, limiting concurrent Redis fetches.
Monitoring and Redis configuration tuning: regularly observed network traffic and adjusted rate‑limit settings to keep Redis stable.
After remediation, the cache‑lookup logic became:
ActivityCache present = activityLocalCache.get(activityDetailCacheKey, key -> getCacheFromRedis(key));
if (present != null) {
return present;
}Additional binary‑cache handling code was introduced:
/**
* 查询二进制缓存
*/
private ActivityCache getBinCacheFromJimdb(String activityDetailCacheBinKey) {
List
activityByteList = slaveCluster.hMget(activityDetailCacheBinKey.getBytes(), "stock".getBytes());
if (activityByteList.get(0) != null && activityByteList.get(0).length > 0) {
byte[] decompress = ByteCompressionUtil.decompress(activityByteList.get(0));
ActivityCache activityCache = ProtostuffUtil.deserialize(decompress, ActivityCache.class);
if (activityCache != null) {
if (activityByteList.get(1) != null && activityByteList.get(1).length > 0) {
activityCache.setAvailableStock(Integer.valueOf(new String(activityByteList.get(1))));
}
return activityCache;
}
}
return null;
}Preventive measures were also defined: consider cache strategy during design, conduct thorough performance and stress testing, and regularly optimize and upgrade the system to incorporate newer technologies.
In conclusion, big‑key and hot‑key pitfalls can trigger serious production incidents; proper cache design, size control, and monitoring are vital to maintain system stability and performance.
JD Tech
Official JD technology sharing platform. All the cutting‑edge JD tech, innovative insights, and open‑source solutions you’re looking for, all in one place.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.