Backend Development 13 min read

Preventing Cache Avalanche, Penetration, and Breakdown in High‑Traffic Systems

This article analyzes a real‑world cloud office system failure caused by cache avalanche, explains the concepts of cache avalanche, penetration, and breakdown, and presents practical solutions such as clustering, rate limiting, random expiration, pre‑warming, Bloom filters, and distributed locking to ensure system stability under heavy load.

Architecture & Thinking
Architecture & Thinking
Architecture & Thinking
Preventing Cache Avalanche, Penetration, and Breakdown in High‑Traffic Systems

1 Real Case

After releasing an optimization for the real‑time user information query feature in a cloud office system, the system experienced a crash where pages could not load.

1.1 Background

The original IM feature displayed basic user information (username, nickname, gender, email, phone) when hovering over a user avatar. The data was fetched from Redis; if missing, about 20,000 user records were loaded into Redis in one batch because the user table was small.

The process is illustrated on the left side of the diagram.

Later the feature was expanded to also collect education history, work experience, and medals, which are stored in separate tables, turning the query into a complex join with large base tables and slow performance.

Storing all users in a single Redis node became infeasible due to memory and CPU pressure, so the developers changed the design to cache each user's comprehensive information in its own Redis node, as shown on the right.

1.2 Problem Handling

Although the new approach seemed fine, the system suffered a bottleneck and freeze the next morning, with database memory and CPU spiking.

The immediate mitigation was to roll back to the previous version that only provided basic information, leaving other fields empty on the front end.

Analysis revealed a cache avalanche: at the peak hour (10 am), many users accessed the system for the first time, causing a massive number of cache misses that flooded the database with requests.

Additionally, the uniform 8‑hour expiration time increased the risk of simultaneous cache expiration.

Emergency measures included using Bloom filters, caching empty values, and randomizing cache expiration times to prevent both cache penetration and avalanche.

The final solution reverted to the original full‑company employee cache, optimized the SQL script by removing unnecessary fields and joins, and used SlowLog to fine‑tune performance.

2 Cache Avalanche

2.1 Concept

A cache avalanche occurs when many keys share the same expiration time, causing them to expire simultaneously and generating a sudden surge of database requests.

2.2 Solution Analysis

2.2.1 Cache Cluster + Database Cluster

Design the system to anticipate high traffic by deploying high‑availability cache clusters (e.g., Redis master‑slave with Sentinel or Redis Cluster) and database clusters (e.g., primary‑replica or full DB clustering) to withstand the load after cache misses.

2.2.2 Appropriate Rate Limiting and Degradation

Use tools like Hystrix, Sentinel, or Google RateLimiter to limit traffic and trigger fallback logic when request volume exceeds system capacity (e.g., 5,000 TPS).

Local caches can also act as a buffer when the Redis cluster is unavailable.

2.2.3 Random Expiration Times

Introduce a random offset to each key's TTL so expirations are spread out (e.g., 6 hours + 0‑2 hours random for an original 8‑hour TTL).

2.2.4 Cache Warm‑up

Pre‑populate caches before peak periods (e.g., start warming from 8 am to 10 am for a 10 am peak) to avoid a sudden load spike.

3 Cache Penetration

3.1 Concept

Cache penetration happens when requests query non‑existent keys, bypassing the cache and hitting the database directly, which can overwhelm the DB under high traffic.

3.2 Solution Analysis

3.2.1 Cache Null Values

Store a null placeholder for missing keys in the cache so subsequent queries return null quickly without hitting the database.

3.2.2 Bloom Filter

Deploy a Bloom filter to quickly test whether a key possibly exists before querying the cache; if the filter says the key is absent, return a default response without DB access.

3.2.3 Choosing Between the Two

For a massive number of keys with low request repetition, use a Bloom filter to filter out invalid keys. For a limited set of missing keys with high repetition, cache null values.

4 Cache Breakdown

4.1 Concept

Cache breakdown occurs when a popular key expires and a flood of requests simultaneously miss the cache, overwhelming the database.

4.2 Solutions

4.2.1 Distributed Lock

Before querying the DB for a missing key, acquire a short‑lived lock (e.g., SETNX). The first request populates the cache; others wait or fall back to the cache after it is set.

4.2.2 Empty Initial Value

Return a temporary empty or default value while the first request populates the cache, sacrificing accuracy for overall system stability.

5 Other Approaches

5.1 Cache Partitioning and Database Sharding

Apply divide‑and‑conquer: partition caches and shard databases to improve scalability as user volume and data size grow.

5.2 Degradation and Circuit Breaking

Implement fallback strategies and circuit breakers to prevent cascading failures during traffic spikes.

cachingbackend performanceCache AvalancheCache BreakdownCache Penetration
Architecture & Thinking
Written by

Architecture & Thinking

🍭 Frontline tech director and chief architect at top-tier companies 🥝 Years of deep experience in internet, e‑commerce, social, and finance sectors 🌾 Committed to publishing high‑quality articles covering core technologies of leading internet firms, application architecture, and AI breakthroughs.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.