Backend Development 7 min read

Cache Big‑Key and Hot‑Key Issues: Case Study, Root‑Cause Analysis, and Mitigation Strategies

This article examines a real‑world incident where oversized and frequently accessed Redis cache keys caused cache penetration and network bandwidth saturation during a high‑traffic promotion, analyzes the underlying reasons, and presents concrete solutions and preventive measures for backend systems.

JD Tech
JD Tech
JD Tech
Cache Big‑Key and Hot‑Key Issues: Case Study, Root‑Cause Analysis, and Mitigation Strategies

In modern software architectures, caching is essential for performance, but misuse can lead to severe incidents, especially when dealing with large (big‑key) or frequently accessed (hot‑key) cache entries.

During a Double‑11 sales event, a system experienced a critical outage: a massive promotional activity generated an oversized cache entry, causing Redis call latency to spike, overall availability to drop from 100% to 20%, and a cascade of failures across core services.

The root causes were twofold: first, cache penetration—when the local JVM cache was empty, a flood of requests simultaneously queried Redis for the newly created key; second, network bandwidth bottleneck—each hot‑key was about 1.5 MB, exceeding the per‑shard bandwidth limit (200 Mbps, roughly 133 concurrent accesses), leading to Redis thread blockage and a cache avalanche.

Original cache‑lookup pseudocode:

ActivityCache present = activityLocalCache.getIfPresent(activityDetailCacheKey);
if (present != null) {
    ActivityCache activityCache = incentiveActivityPOConvert.copyActivityCache(present);
    return activityCache;
}
ActivityCache remoteCache = getCacheFromRedis(activityDetailCacheKey);
activityLocalCache.put(activityDetailCacheKey, remoteCache);
return remoteCache;

To address the issue, the team implemented several measures:

Big‑key governance: switched serialization from JSON to Protostuff, reducing object size from 1.5 MB to 0.5 MB.

Compression: applied gzip compression with a threshold, shrinking a 500 KB payload to 17 KB.

Cache back‑source optimization: added a thread lock when the local cache missed, limiting concurrent Redis fetches.

Monitoring and Redis configuration tuning: regularly observed network traffic and adjusted rate‑limit settings to keep Redis stable.

After remediation, the cache‑lookup logic became:

ActivityCache present = activityLocalCache.get(activityDetailCacheKey, key -> getCacheFromRedis(key));
if (present != null) {
    return present;
}

Additional binary‑cache handling code was introduced:

/**
 * 查询二进制缓存
 */
private ActivityCache getBinCacheFromJimdb(String activityDetailCacheBinKey) {
    List
activityByteList = slaveCluster.hMget(activityDetailCacheBinKey.getBytes(), "stock".getBytes());
    if (activityByteList.get(0) != null && activityByteList.get(0).length > 0) {
        byte[] decompress = ByteCompressionUtil.decompress(activityByteList.get(0));
        ActivityCache activityCache = ProtostuffUtil.deserialize(decompress, ActivityCache.class);
        if (activityCache != null) {
            if (activityByteList.get(1) != null && activityByteList.get(1).length > 0) {
                activityCache.setAvailableStock(Integer.valueOf(new String(activityByteList.get(1))));
            }
            return activityCache;
        }
    }
    return null;
}

Preventive measures were also defined: consider cache strategy during design, conduct thorough performance and stress testing, and regularly optimize and upgrade the system to incorporate newer technologies.

In conclusion, big‑key and hot‑key pitfalls can trigger serious production incidents; proper cache design, size control, and monitoring are vital to maintain system stability and performance.

Performance OptimizationBackend DevelopmentRedisCache PenetrationBig Keyhot key
JD Tech
Written by

JD Tech

Official JD technology sharing platform. All the cutting‑edge JD tech, innovative insights, and open‑source solutions you’re looking for, all in one place.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.