How a Massive Cache Key Crashed Our System—and the Fixes That Saved It

During a major promotion a huge activity generated an oversized Redis cache key that caused cache‑penetration, saturated network bandwidth, and triggered a cascade of service failures, prompting a detailed root‑cause analysis and a set of mitigation and prevention measures.

dbaplus Community
dbaplus Community
dbaplus Community
How a Massive Cache Key Crashed Our System—and the Fixes That Saved It

Case Description

During a major promotion, a system created a huge activity whose cached data exceeded 1.5 MB. After launch, Redis call volume and latency spiked, UMP availability dropped from 100 % to 20 %, causing a cascade of failures and making the service unavailable.

Root Cause Analysis

The team used Redis as a cache and added a 5‑minute JVM local cache. However, when the activity went live the local cache was empty, causing a massive number of requests to hit Redis (cache miss). This triggered a cache‑penetration problem.

Additionally, the hot key size (≈1.5 MB) saturated the network bandwidth of a single Redis shard (default 200 Mbps ≈ 133 concurrent accesses), leading to bandwidth throttling, thread blocking, and a cache avalanche.

Solution

Implemented four measures:

Big‑key mitigation: Switched serialization from JSON to Protostuff, reducing object size from 1.5 MB to 0.5 MB.

Compression: Applied gzip compression with a threshold, shrinking a 500 KB payload to 17 KB.

Cache‑back‑origin optimization: Added a lock when the local cache misses to limit concurrent Redis fetches.

Redis monitoring & configuration: Regularly monitor network usage and adjust rate‑limit settings.

Updated cache‑fetch code:

ActivityCache present = activityLocalCache.getIfPresent(activityDetailCacheKey);
if (present != null) {
    return present;
}
ActivityCache remoteCache = getCacheFromRedis(activityDetailCacheKey);
activityLocalCache.put(activityDetailCacheKey, remoteCache);
return remoteCache;

Further refactoring introduced binary cache handling with Protostuff deserialization (code omitted for brevity).

Prevention Measures

Design stage: evaluate cache strategy and avoid large keys.

Conduct pressure testing and performance profiling before release.

Periodically optimize and upgrade the system, adopting new tools to improve stability.

Conclusion

Big‑key and hot‑key issues are common pitfalls; neglecting them can cause severe outages. Proper serialization, compression, lock‑based back‑origin, and proactive monitoring are essential to keep cache performance reliable.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

BackendJavaCacheHotKeyRedisBigKey
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.