How a 1.5 MB Redis Key Crashed an Entire Site and How to Prevent It

A real‑world incident at JD Tech shows that a single 1.5 MB Redis cache key caused a full‑site outage during a high‑traffic event, and the article explains the underlying cache‑breakdown and bandwidth traps, then details three emergency mitigations and long‑term preventive practices.

Cognitive Technology Team
Cognitive Technology Team
Cognitive Technology Team
How a 1.5 MB Redis Key Crashed an Entire Site and How to Prevent It

1. Incident: Immediate Collapse Due to a Large Cache Key

During a Double‑11 promotion, a 1.5 MB Redis key was written to cache as a serialized object. When the promotion went live, all service instances suffered a cold start, thousands of QPS requests bypassed local caches, and flooded the same Redis key, causing the core service availability to drop from 100% to 20%.

2. Two Fatal Traps: Cache Breakdown and Bandwidth Saturation

Trap 1: Cache Breakdown

Cache breakdown occurs when a hot key expires or is not pre‑warmed, causing a massive burst of concurrent requests to hit the backing store—in this case Redis—turning it into a single‑point bottleneck.

Trap 2: Bandwidth Saturation

The 1.5 MB key exceeded the Redis shard’s network limit of 200 Mbps (≈25 MB/s). Simple calculation shows an effective throughput of only 16‑17 requests per second (25 MB/s ÷ 1.5 MB), far below the actual QPS, leading to:

Redis shard network congestion

Request queuing and timeouts

Redis thread blocking

Whole Redis instance slowdown or denial of service

The localized failure quickly escalated into a full‑site cache avalanche, affecting multiple critical business flows.

3. Emergency Mitigation: Three Quick Fixes

1. Serialization Slimming: JSON → Protostuff

Switching from verbose JSON to Protostuff binary serialization reduced the payload from 1.5 MB to 500 KB, a 66% size reduction.

2. Enable Gzip Compression

Objects larger than a threshold (e.g., 100 KB) are automatically gzipped, shrinking the 500 KB payload to 17 KB and cutting network load by over 98%.

Compression is not a panacea, but for large keys it can be a lifesaver; set reasonable thresholds to avoid unnecessary CPU overhead.

3. Local Cache Locking to Prevent Penetration

Using Guava Cache’s get(key, callable) ensures that only one thread fetches the missing key from Redis while others wait, completely avoiding cache breakdown.

4. Lessons Learned: Proactive Prevention

Design Guidelines: Prohibit caching whole tables or complex aggregate objects; prefer field‑level splitting and on‑demand loading.

Release Process: Require large‑key + high‑concurrency stress testing for any cache‑related change.

Monitoring & Alerts: Detect single keys >100 KB, sudden QPS spikes, and shard bandwidth usage >80%.

Middleware Governance: Default to efficient serialization (Protostuff/Kryo) and automatic compression strategies.

The team also introduced a “cache health score” that incorporates key size, access frequency, and update rate into release review criteria.

5. Conclusion

Cache can dramatically boost performance, but without careful data‑structure design, capacity planning, and concurrency control it becomes a time bomb. No silver‑bullet solution exists; trade‑offs must match the specific scenario.

In the pursuit of extreme performance, we must respect every byte and treat each cache write with caution.
Extended thought: Does your system have “silent big keys”? Run MEMORY USAGE your_key in Redis to find out.
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

PerformancecacheRedisCache Design
Cognitive Technology Team
Written by

Cognitive Technology Team

Cognitive Technology Team regularly delivers the latest IT news, original content, programming tutorials and experience sharing, with daily perks awaiting you.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.