How a Massive Cache Key Crashed a System and How to Prevent It

This article examines a real-world incident where a massive cache key and cache penetration during a high‑traffic promotion caused Redis overload and a system outage, then details root‑cause analysis, mitigation steps like serialization changes, compression, lock‑based fallback, and preventive best practices.

JD Tech Talk
JD Tech Talk
JD Tech Talk
How a Massive Cache Key Crashed a System and How to Prevent It

Introduction

In modern software architecture, caching is essential for performance, but misuse can cause serious incidents. This article explores the often‑overlooked problems of large cache keys and cache penetration, using a real case to analyze causes and propose solutions and preventive measures.

Case Description

During a major sales event, a system created an activity with many conditions and rewards, generating an extremely large cache entry. After launch, alerts surged, Redis call volume and query performance plummeted, and the overall service became unavailable.

Root Cause Analysis

The team used Redis for activity caching and added a 5‑minute local JVM cache as a safeguard. Two main traps emerged:

Cache Penetration : When the activity was first approved, the data existed only in Redis while the local cache was empty, causing a flood of requests to hit Redis simultaneously.

Network Bandwidth Bottleneck : The hot key size reached 1.5 MB. Redis shards limit bandwidth (≈200 Mbps), allowing only about 133 concurrent accesses for such a key. Combined with the penetration burst, the shard hit its limit, threads blocked, and a cache avalanche ensued.

Solution

The team implemented the following measures:

Big‑Key Governance : Switched serialization from JSON to Protostuff, reducing object size from 1.5 MB to 0.5 MB.

Compression : Applied gzip compression to cached objects, shrinking a 500 KB payload to 17 KB.

Cache Fallback Optimization : Added a thread lock when the local cache misses, limiting concurrent Redis fetches.

Monitoring & Redis Configuration Tuning : Regularly monitor Redis network usage and adjust rate‑limit settings to ensure stability.

Revised cache‑retrieval pseudo‑code:

ActivityCache present = activityLocalCache.getIfPresent(activityDetailCacheKey);
if (present != null) {
    ActivityCache activityCache = incentiveActivityPOConvert.copyActivityCache(present);
    return activityCache;
}
ActivityCache remoteCache = getCacheFromRedis(activityDetailCacheKey);
activityLocalCache.put(activityDetailCacheKey, remoteCache);
return remoteCache;

After governance, the simplified retrieval becomes:

ActivityCache present = activityLocalCache.get(activityDetailCacheKey, key -> getCacheFromRedis(key));
if (present != null) {
    return present;
}

Additional binary‑cache handling method:

/**
 * Query binary cache
 */
private ActivityCache getBinCacheFromJimdb(String activityDetailCacheBinKey) {
    List<byte[]> activityByteList = slaveCluster.hMget(activityDetailCacheBinKey.getBytes(), "stock".getBytes());
    if (activityByteList.get(0) != null && activityByteList.get(0).length > 0) {
        byte[] decompress = ByteCompressionUtil.decompress(activityByteList.get(0));
        ActivityCache activityCache = ProtostuffUtil.deserialize(decompress, ActivityCache.class);
        if (activityCache != null) {
            if (activityByteList.get(1) != null && activityByteList.get(1).length > 0) {
                activityCache.setAvailableStock(Integer.valueOf(new String(activityByteList.get(1))));
            }
            return activityCache;
        }
    }
    return null;
}
image.png
image.png

Preventive Measures

Design Cache Strategy Early : Consider cache usage scenarios and data characteristics during system design to avoid blind large‑key caching.

Stress Testing & Performance Evaluation : Conduct thorough load tests before release to simulate high concurrency and large data volumes.

Regular System Optimization : Periodically upgrade and refactor the system, adopting new tools and techniques to maintain performance and stability.

Conclusion

Large cache keys and hot keys are common pitfalls; neglecting them can trigger severe outages. By understanding the risks and applying the discussed mitigation and prevention strategies, developers can use caching effectively to boost system performance without compromising reliability.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Performance Optimizationcachingcache penetrationBig Key
JD Tech Talk
Written by

JD Tech Talk

Official JD Tech public account delivering best practices and technology innovation.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.