How a Massive Cache Key Crashed a Double‑11 System and How to Prevent It
During a Double‑11 promotion, an oversized Redis cache key caused a cascade of failures—cache miss, network bandwidth saturation, and a full‑blown cache avalanche—prompting the team to implement big‑key mitigation, compression, lock‑based cache back‑source, and monitoring measures to safeguard future deployments.
Introduction
In modern software architecture, caching is essential for performance, but misuse can cause severe incidents, especially with large or hot cache keys. This article examines a real case of cache big‑key and cache breakdown problems, analyzes the causes, and offers solutions and preventive measures.
Case Description
During a Double‑11 promotion, a system created a massive activity that generated an extremely large cache entry. After launch, Redis call volume and query performance plummeted, core UMP availability dropped from 100% to 20%, and the outage cascaded to multiple core interfaces, rendering the service unavailable.
Root Cause Analysis
The team used Redis to cache each activity. Anticipating big‑key and hot‑key issues, they added a 5‑minute local JVM cache before falling back to Redis. However, two traps emerged:
Cache breakdown : When the local cache was empty, a surge of requests hit Redis simultaneously, causing a cache miss storm.
Network bandwidth bottleneck : The hot key size reached 1.5 MB. Redis limits per‑shard bandwidth to 200 Mbps, supporting only about 133 concurrent accesses. Combined with the breakdown, the bandwidth limit was quickly exceeded, blocking Redis threads and triggering a cache avalanche.
Solution
The team applied four measures:
Big‑key mitigation : Switched serialization from JSON to Protostuff, reducing object size from 1.5 MB to 0.5 MB.
Compression : Applied gzip compression to cache objects, shrinking a 500 KB payload to 17 KB.
Cache back‑source optimization : Added a thread lock when the local cache missed, limiting concurrent Redis fetches.
Redis monitoring and configuration : Regularly monitored network usage and adjusted Redis rate‑limit settings.
After remediation, the cache retrieval logic became:
ActivityCache present = activityLocalCache.get(activityDetailCacheKey, key -> getCacheFromRedis(key));
if (present != null) {
return present;
}Additional binary‑cache handling code:
/**
* 查询二进制缓存
* @param activityDetailCacheBinKey
* @return
*/
private ActivityCache getBinCacheFromJimdb(String activityDetailCacheBinKey) {
List<byte[]> activityByteList = slaveCluster.hMget(activityDetailCacheBinKey.getBytes(), "stock".getBytes());
if (activityByteList.get(0) != null && activityByteList.get(0).length > 0) {
byte[] decompress = ByteCompressionUtil.decompress(activityByteList.get(0));
ActivityCache activityCache = ProtostuffUtil.deserialize(decompress, ActivityCache.class);
if (activityCache != null) {
if (activityByteList.get(1) != null && activityByteList.get(1).length > 0) {
activityCache.setAvailableStock(Integer.valueOf(new String(activityByteList.get(1))));
}
return activityCache;
}
}
return null;
}Prevention Measures
Design‑stage cache strategy : Consider cache usage scenarios and data characteristics early to avoid blind large‑key caching.
Stress testing and performance evaluation : Simulate high concurrency and large data volumes before release to uncover potential issues.
Regular system optimization and upgrades : Continuously refine architecture, adopt new tools, and improve performance and stability.
Conclusion
Big‑key and hot‑key pitfalls are common in caching. Proper cache design, size control, compression, and monitoring are crucial to prevent severe online incidents. Use caching wisely to boost performance rather than indiscriminately storing all data.
JD Cloud Developers
JD Cloud Developers (Developer of JD Technology) is a JD Technology Group platform offering technical sharing and communication for AI, cloud computing, IoT and related developers. It publishes JD product technical information, industry content, and tech event news. Embrace technology and partner with developers to envision the future.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
