How a Massive Cache Key Crashed Our System—and the Fixes That Saved It
During a major promotion a huge activity generated an oversized Redis cache key that caused cache‑penetration, saturated network bandwidth, and triggered a cascade of service failures, prompting a detailed root‑cause analysis and a set of mitigation and prevention measures.
Case Description
During a major promotion, a system created a huge activity whose cached data exceeded 1.5 MB. After launch, Redis call volume and latency spiked, UMP availability dropped from 100 % to 20 %, causing a cascade of failures and making the service unavailable.
Root Cause Analysis
The team used Redis as a cache and added a 5‑minute JVM local cache. However, when the activity went live the local cache was empty, causing a massive number of requests to hit Redis (cache miss). This triggered a cache‑penetration problem.
Additionally, the hot key size (≈1.5 MB) saturated the network bandwidth of a single Redis shard (default 200 Mbps ≈ 133 concurrent accesses), leading to bandwidth throttling, thread blocking, and a cache avalanche.
Solution
Implemented four measures:
Big‑key mitigation: Switched serialization from JSON to Protostuff, reducing object size from 1.5 MB to 0.5 MB.
Compression: Applied gzip compression with a threshold, shrinking a 500 KB payload to 17 KB.
Cache‑back‑origin optimization: Added a lock when the local cache misses to limit concurrent Redis fetches.
Redis monitoring & configuration: Regularly monitor network usage and adjust rate‑limit settings.
Updated cache‑fetch code:
ActivityCache present = activityLocalCache.getIfPresent(activityDetailCacheKey);
if (present != null) {
return present;
}
ActivityCache remoteCache = getCacheFromRedis(activityDetailCacheKey);
activityLocalCache.put(activityDetailCacheKey, remoteCache);
return remoteCache;Further refactoring introduced binary cache handling with Protostuff deserialization (code omitted for brevity).
Prevention Measures
Design stage: evaluate cache strategy and avoid large keys.
Conduct pressure testing and performance profiling before release.
Periodically optimize and upgrade the system, adopting new tools to improve stability.
Conclusion
Big‑key and hot‑key issues are common pitfalls; neglecting them can cause severe outages. Proper serialization, compression, lock‑based back‑origin, and proactive monitoring are essential to keep cache performance reliable.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
