How a 1.5 MB Redis Key Crashed an Entire Site and How to Prevent It
A real‑world incident at JD Tech shows that a single 1.5 MB Redis cache key caused a full‑site outage during a high‑traffic event, and the article explains the underlying cache‑breakdown and bandwidth traps, then details three emergency mitigations and long‑term preventive practices.
1. Incident: Immediate Collapse Due to a Large Cache Key
During a Double‑11 promotion, a 1.5 MB Redis key was written to cache as a serialized object. When the promotion went live, all service instances suffered a cold start, thousands of QPS requests bypassed local caches, and flooded the same Redis key, causing the core service availability to drop from 100% to 20%.
2. Two Fatal Traps: Cache Breakdown and Bandwidth Saturation
Trap 1: Cache Breakdown
Cache breakdown occurs when a hot key expires or is not pre‑warmed, causing a massive burst of concurrent requests to hit the backing store—in this case Redis—turning it into a single‑point bottleneck.
Trap 2: Bandwidth Saturation
The 1.5 MB key exceeded the Redis shard’s network limit of 200 Mbps (≈25 MB/s). Simple calculation shows an effective throughput of only 16‑17 requests per second (25 MB/s ÷ 1.5 MB), far below the actual QPS, leading to:
Redis shard network congestion
Request queuing and timeouts
Redis thread blocking
Whole Redis instance slowdown or denial of service
The localized failure quickly escalated into a full‑site cache avalanche, affecting multiple critical business flows.
3. Emergency Mitigation: Three Quick Fixes
1. Serialization Slimming: JSON → Protostuff
Switching from verbose JSON to Protostuff binary serialization reduced the payload from 1.5 MB to 500 KB, a 66% size reduction.
2. Enable Gzip Compression
Objects larger than a threshold (e.g., 100 KB) are automatically gzipped, shrinking the 500 KB payload to 17 KB and cutting network load by over 98%.
Compression is not a panacea, but for large keys it can be a lifesaver; set reasonable thresholds to avoid unnecessary CPU overhead.
3. Local Cache Locking to Prevent Penetration
Using Guava Cache’s get(key, callable) ensures that only one thread fetches the missing key from Redis while others wait, completely avoiding cache breakdown.
4. Lessons Learned: Proactive Prevention
Design Guidelines: Prohibit caching whole tables or complex aggregate objects; prefer field‑level splitting and on‑demand loading.
Release Process: Require large‑key + high‑concurrency stress testing for any cache‑related change.
Monitoring & Alerts: Detect single keys >100 KB, sudden QPS spikes, and shard bandwidth usage >80%.
Middleware Governance: Default to efficient serialization (Protostuff/Kryo) and automatic compression strategies.
The team also introduced a “cache health score” that incorporates key size, access frequency, and update rate into release review criteria.
5. Conclusion
Cache can dramatically boost performance, but without careful data‑structure design, capacity planning, and concurrency control it becomes a time bomb. No silver‑bullet solution exists; trade‑offs must match the specific scenario.
In the pursuit of extreme performance, we must respect every byte and treat each cache write with caution.
Extended thought: Does your system have “silent big keys”? Run MEMORY USAGE your_key in Redis to find out.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Cognitive Technology Team
Cognitive Technology Team regularly delivers the latest IT news, original content, programming tutorials and experience sharing, with daily perks awaiting you.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
