How a Massive Cache Key Crashed a System and How to Prevent It
This article examines a real-world incident where a massive cache key and cache penetration during a high‑traffic promotion caused Redis overload and a system outage, then details root‑cause analysis, mitigation steps like serialization changes, compression, lock‑based fallback, and preventive best practices.
Introduction
In modern software architecture, caching is essential for performance, but misuse can cause serious incidents. This article explores the often‑overlooked problems of large cache keys and cache penetration, using a real case to analyze causes and propose solutions and preventive measures.
Case Description
During a major sales event, a system created an activity with many conditions and rewards, generating an extremely large cache entry. After launch, alerts surged, Redis call volume and query performance plummeted, and the overall service became unavailable.
Root Cause Analysis
The team used Redis for activity caching and added a 5‑minute local JVM cache as a safeguard. Two main traps emerged:
Cache Penetration : When the activity was first approved, the data existed only in Redis while the local cache was empty, causing a flood of requests to hit Redis simultaneously.
Network Bandwidth Bottleneck : The hot key size reached 1.5 MB. Redis shards limit bandwidth (≈200 Mbps), allowing only about 133 concurrent accesses for such a key. Combined with the penetration burst, the shard hit its limit, threads blocked, and a cache avalanche ensued.
Solution
The team implemented the following measures:
Big‑Key Governance : Switched serialization from JSON to Protostuff, reducing object size from 1.5 MB to 0.5 MB.
Compression : Applied gzip compression to cached objects, shrinking a 500 KB payload to 17 KB.
Cache Fallback Optimization : Added a thread lock when the local cache misses, limiting concurrent Redis fetches.
Monitoring & Redis Configuration Tuning : Regularly monitor Redis network usage and adjust rate‑limit settings to ensure stability.
Revised cache‑retrieval pseudo‑code:
ActivityCache present = activityLocalCache.getIfPresent(activityDetailCacheKey);
if (present != null) {
ActivityCache activityCache = incentiveActivityPOConvert.copyActivityCache(present);
return activityCache;
}
ActivityCache remoteCache = getCacheFromRedis(activityDetailCacheKey);
activityLocalCache.put(activityDetailCacheKey, remoteCache);
return remoteCache;After governance, the simplified retrieval becomes:
ActivityCache present = activityLocalCache.get(activityDetailCacheKey, key -> getCacheFromRedis(key));
if (present != null) {
return present;
}Additional binary‑cache handling method:
/**
* Query binary cache
*/
private ActivityCache getBinCacheFromJimdb(String activityDetailCacheBinKey) {
List<byte[]> activityByteList = slaveCluster.hMget(activityDetailCacheBinKey.getBytes(), "stock".getBytes());
if (activityByteList.get(0) != null && activityByteList.get(0).length > 0) {
byte[] decompress = ByteCompressionUtil.decompress(activityByteList.get(0));
ActivityCache activityCache = ProtostuffUtil.deserialize(decompress, ActivityCache.class);
if (activityCache != null) {
if (activityByteList.get(1) != null && activityByteList.get(1).length > 0) {
activityCache.setAvailableStock(Integer.valueOf(new String(activityByteList.get(1))));
}
return activityCache;
}
}
return null;
}Preventive Measures
Design Cache Strategy Early : Consider cache usage scenarios and data characteristics during system design to avoid blind large‑key caching.
Stress Testing & Performance Evaluation : Conduct thorough load tests before release to simulate high concurrency and large data volumes.
Regular System Optimization : Periodically upgrade and refactor the system, adopting new tools and techniques to maintain performance and stability.
Conclusion
Large cache keys and hot keys are common pitfalls; neglecting them can trigger severe outages. By understanding the risks and applying the discussed mitigation and prevention strategies, developers can use caching effectively to boost system performance without compromising reliability.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
JD Tech Talk
Official JD Tech public account delivering best practices and technology innovation.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
