Redis Cache Avalanche, Penetration, and Breakdown: The Three Must‑Know Issues for Interviews

This article explains the three classic Redis cache problems—avalanche, penetration, and breakdown—detailing their definitions, typical symptoms, step‑by‑step troubleshooting procedures, root‑cause analysis, and practical mitigation strategies such as random expiration, empty‑value caching, Bloom filters, distributed locks, and multi‑level cache architectures.

Ops Community
Ops Community
Ops Community
Redis Cache Avalanche, Penetration, and Breakdown: The Three Must‑Know Issues for Interviews

Cache Basics

Read flow: request → query Redis cache → hit returns data; miss queries the database, writes back to Redis. Write flow: write to DB → write/delete Redis.

Common eviction policies: volatile-lru, allkeys-lru, volatile-ttl, allkeys-random, etc. Production environments typically use volatile-lru or allkeys-lru to keep hot data while evicting less‑used keys.

Cache Avalanche

Definition

A cache avalanche occurs when a large number of keys expire simultaneously or the Redis service itself fails, causing a sudden surge of database queries that can overload the DB.

Typical Scenarios

Uniform expiration time (e.g., all product detail caches set to 2 h and expire at the same hour).

Redis node crash, memory OOM, or network partition.

Symptoms

# Example monitoring output
- Redis hit rate drops from 99% to <20%
- MySQL QPS spikes from 500 to >20000
- Response time rises from 50 ms to >5 s

Investigation Steps

Check Redis hit rate:

redis-cli info stats | grep -E "keyspace_hits|keyspace_misses"

Inspect key expiration distribution:

redis-cli info keyspace
# Sample random keys and their TTL
for i in {1..20}; do
  key=$(redis-cli --scan --pattern "product:*" | shuf -n 1)
  ttl=$(redis-cli ttl "$key")
  echo "Key: $key, TTL: $ttl seconds"
done

Monitor MySQL pressure (QPS, processlist, slow queries).

Verify Redis service status (process, port, logs, Sentinel/Cluster info).

Root Causes

Uniform expiration ("whole‑hour" expiry).

Redis node failure (crash, OOM, network split).

Insufficient cache warm‑up after service restart.

Hot data set expiring together.

Mitigation Strategies

Randomize TTL

def set_cache(key, value, base_expire=7200):
    import random
    expire = base_expire + random.randint(0, 600)  # 0~600 s jitter
    redis_client.setex(key, expire, value)

Hot keys never expire + async refresh

def get_user_info(user_id):
    cache_key = f"user:v1:{user_id}"
    cached = redis_client.get(cache_key)
    if cached:
        return json.loads(cached)
    user = db.query("SELECT * FROM users WHERE id=%s", user_id)
    if not user:
        redis_client.setex(cache_key, 300, "NULL")  # cache empty result 5 min
        return None
    redis_client.set(cache_key, json.dumps(user))  # never expire
    return user

Multi‑level cache (Nginx local → Redis → DB)

# Nginx local cache snippet
proxy_cache_path /var/cache/nginx levels=1:2 keys_zone=local_cache:10m max_size=100m inactive=60s;
proxy_cache_key "$host$request_uri";
proxy_cache_valid 200 30s;

Redis high‑availability (Sentinel or Cluster)

# /etc/redis/sentinel.conf (example)
port 26379
sentinel monitor mymaster 192.168.1.10 6379 2
sentinel down-after-milliseconds mymaster 30000
sentinel parallel-syncs mymaster 1
sentinel failover-timeout mymaster 180000

Circuit breaker & rate limiting

# Simple Python rate‑limit decorator (implementation omitted)

Cache Penetration

Definition

Cache penetration happens when requests query data that does not exist in both cache and DB, causing every request to hit the database.

Typical Scenarios

Malicious attacks that enumerate non‑existent IDs.

Business logic that allows invalid query parameters.

Crawlers probing missing pages or products.

Symptoms

# Monitoring output
- Cache hit rate appears normal or high (most keys never exist)
- MySQL QPS abnormally high
- Lots of "record not found" logs
- Requests contain abnormal parameters (e.g., id=-1, very long strings)

Investigation Steps

Enable MySQL general log briefly and look for repeated "record not found" queries.

Analyze Nginx access logs for abnormal patterns (invalid IDs, unusual User‑Agent, IP distribution).

Check Redis for cached null values:

redis-cli --scan --pattern "*:NULL*" | head -20

Root Causes

Business design that permits queries for non‑existent data without caching empty results.

Deliberate malicious bulk requests.

Crawlers indexing non‑existent resources.

Stale cache after data deletion.

Mitigation Strategies

Cache empty results

def get_user_info(user_id):
    cache_key = f"user:{user_id}"
    cached = redis_client.get(cache_key)
    if cached == "NULL":
        return None
    if cached:
        return json.loads(cached)
    user = db.query("SELECT * FROM users WHERE id=%s", user_id)
    if not user:
        redis_client.setex(cache_key, 300, "NULL")  # 5 min
        return None
    redis_client.setex(cache_key, 3600, json.dumps(user))
    return user

Bloom filter pre‑check

# Python example (requires pybloom-live)
from pybloom_live import BloomFilter
bf = BloomFilter(capacity=1_000_000, error_rate=0.01)

def get_with_filter(key, db_query_func):
    if not bf.__contains__(key):
        return None  # definitely absent
    cached = redis_client.get(key)
    if cached:
        return json.loads(cached) if cached != "NULL" else None
    result = db_query_func(key)
    if result:
        bf.add(key)
        redis_client.setex(key, 3600, json.dumps(result))
    else:
        bf.add(key)
        redis_client.setex(key, 300, "NULL")
    return result
// Java example using Guava BloomFilter
import com.google.common.hash.BloomFilter;
import com.google.common.hash.Funnels;
BloomFilter<String> bloom = BloomFilter.create(Funnels.stringFunnel(Charset.forName("UTF-8")), 1_000_000, 0.01);

Parameter validation & rate limiting

# Simple Python rate‑limit decorator (implementation omitted)

IP blacklist (Nginx)

# Nginx geo block example
geo $bad_ip {
    default 0;
    192.168.1.100 1;
    10.0.0.0/8 0;
}
if ($bad_ip) { return 403; }

Cache Breakdown

Definition

Cache breakdown (or cache stampede) occurs when a single hot key expires or is missing, and a flood of concurrent requests miss the cache and query the database simultaneously.

Difference from Avalanche

Avalanche : many keys expire or Redis fails – impact is system‑wide.

Breakdown : one hotspot key expires – impact is limited to that key’s traffic.

Symptoms

# Monitoring output for a hot key
- QPS for the key jumps from 100 to 10 000
- Redis hit rate for that key drops sharply
- DB CPU spikes on the related table
- Overall response time degrades

Investigation Steps

Identify the hot key (Redis MONITOR, slowlog, or pattern scan).

Check its TTL distribution:

redis-cli --scan --pattern "product:*" | while read key; do
  ttl=$(redis-cli ttl "$key")
  echo "$key: $ttl seconds"
  done | head -20

Confirm hotspot status (e.g., >1000 accesses in 10 s):

def is_hot_key(key):
    count_key = f"hot:count:{key}"
    cur = redis_client.incr(count_key)
    if cur == 1:
        redis_client.expire(count_key, 10)
    return cur > 1000

Root Causes

Hot key expiration (flash‑sale items, popular product details).

First access after service restart or Redis failover.

Data model that concentrates traffic on a single key.

Mitigation Strategies

Hot key never expires + async refresh

class HotCache:
    def __init__(self):
        self.cache = {}
        self.lock = threading.Lock()
        self.refreshing = set()

    def get(self, key):
        if key in self.cache:
            value, expire = self.cache[key]
            if time.time() > expire - 60:  # refresh 60 s before expiry
                self._async_refresh(key)
            return value
        return None

    def _async_refresh(self, key):
        with self.lock:
            if key in self.refreshing:
                return
            self.refreshing.add(key)
        def refresh():
            try:
                data = db.query("SELECT * FROM products WHERE id=%s", key)
                with self.lock:
                    self.cache[key] = (data, time.time() + 3600)
            finally:
                with self.lock:
                    self.refreshing.discard(key)
        t = threading.Thread(target=refresh, daemon=True)
        t.start()

Distributed lock (SETNX / RedLock)

# Acquire lock
token = redis_client.set("lock:{key}", uuid, nx=True, ex=5)
if token:
    try:
        data = db_query(key)
        redis_client.setex("cache:{key}", 3600, json.dumps(data))
    finally:
        # Lua script for safe release
        redis_client.eval("if redis.call('get', KEYS[1]) == ARGV[1] then return redis.call('del', KEYS[1]) else return 0 end", 1, "lock:{key}", token)
else:
    time.sleep(0.1)
    return get_with_lock(key, db_query)

Redis SETNX mutex

def get_with_mutex(key, db_query_func):
    cache_key = f"cache:{key}"
    lock_key = f"lock:{key}"
    result = redis_client.get(cache_key)
    if result:
        return json.loads(result)
    lock_acquired = redis_client.set(lock_key, uuid.uuid4().hex, nx=True, ex=10)
    if lock_acquired:
        try:
            # double‑check cache after acquiring lock
            result = redis_client.get(cache_key)
            if result:
                return json.loads(result)
            data = db_query_func(key)
            redis_client.setex(cache_key, 3600, json.dumps(data))
            return data
        finally:
            if redis_client.get(lock_key) == lock_token:
                redis_client.delete(lock_key)
    else:
        # wait and retry a few times
        for _ in range(3):
            time.sleep(0.05)
            result = redis_client.get(cache_key)
            if result:
                return json.loads(result)
        return None

Hot‑key pre‑warm (periodic preload)

def preload_hot_cache():
    hot_products = db.query("""
        SELECT * FROM products
        WHERE status='active' AND view_count>1000
        ORDER BY view_count DESC
        LIMIT 100
    """)
    for p in hot_products:
        key = f"product:{p['id']}"
        redis_client.setex(key, 7200, json.dumps(p))
    redis_client.setex("cache:preload:hot_products", 3600, str(time.time()))
    return len(hot_products)

Local LRU cache as fallback

class LocalCache:
    def __init__(self, maxsize=10000, ttl=60):
        self.store = collections.OrderedDict()
        self.maxsize = maxsize
        self.ttl = ttl
        self.lock = threading.Lock()

    def get(self, key):
        with self.lock:
            if key in self.store:
                val, exp = self.store[key]
                if time.time() < exp:
                    self.store.move_to_end(key)
                    return val
                del self.store[key]
            return None

    def set(self, key, val):
        with self.lock:
            self.store[key] = (val, time.time() + self.ttl)
            self.store.move_to_end(key)
            while len(self.store) > self.maxsize:
                self.store.popitem(last=False)

local_cache = LocalCache()

def get_with_local_cache(key, db_query_func):
    val = local_cache.get(key)
    if val:
        return val
    cache_key = f"cache:{key}"
    cached = redis_client.get(cache_key)
    if cached:
        val = json.loads(cached)
        local_cache.set(key, val)
        return val
    val = db_query_func(key)
    if val:
        redis_client.setex(cache_key, 3600, json.dumps(val))
        local_cache.set(key, val)
    return val

Comparative Overview

Problem nature : Avalanche – many keys expire together; Penetration – queries for non‑existent data; Breakdown – single hot key expires.

Impact scope : Avalanche – massive request surge; Penetration – continuous load from invalid requests; Breakdown – burst on the hot key.

Database pressure : Avalanche – instant overload; Penetration – sustained overload; Breakdown – instant overload for that key.

Typical scenarios : Avalanche – whole‑hour expiry, Redis failure; Penetration – malicious attacks, crawlers; Breakdown – flash‑sale item expiry, hot‑key miss.

Core solutions : Avalanche – random TTL, HA, multi‑level cache; Penetration – null‑value caching, Bloom filter, rate limiting; Breakdown – distributed lock, hot‑key never expire, local cache.

Production Best Practices

Cache Architecture (L1 → L2 → DB)

+-----------------+
                |   User Request  |
                +--------+--------+
                         |
                         v
                +------------------------+
                | Nginx (rate‑limit, auth) |
                +--------+--------------+
                         |
                         v
                +------------------------+
                | API Gateway (routing) |
                +--------+--------------+
                         |
          +--------------+--------------+
          |              |              |
          v              v              v
   +--------+      +---------+      +---------+
   |Service1|      |Service2|      |Service3|
   +--------+      +---------+      +---------+
          |              |              |
          +--------------+--------------+
                         |
                         v
                +------------------------+
                |   Local Cache (L1)    |
                |   (Guava / Caffeine)   |
                +--------+--------------+
                         |
                         v
                +------------------------+
                |   Redis Cache (L2)    |
                +--------+--------------+
                         |
                         v
                +------------------------+
                |   MySQL Database       |
                +------------------------+

Redis Configuration (/etc/redis/redis.conf)

# Memory limit
maxmemory 2gb
maxmemory-policy volatile-lru

# Persistence (RDB + AOF recommended)
save 900 1
save 300 10
save 60 10000
appendonly yes
appendfsync everysec

# Connection settings
timeout 300
tcp-keepalive 60
maxclients 10000

# Slow‑query log
slowlog-log-slower-than 10000
slowlog-max-len 128

Application‑level Cache (Spring Boot example)

spring:
  redis:
    host: 192.168.1.10
    port: 6379
    password: yourpassword
    timeout: 3000ms
    lettuce:
      pool:
        max-active: 200
        max-idle: 50
        min-idle: 10
        max-wait: 1000ms
cache:
  type: redis
  redis:
    time-to-live: 300000   # 5 min
  cache-null-values: false

Monitoring & Alerting (Prometheus rules)

groups:
- name: redis_alerts
  interval: 30s
  rules:
  - alert: RedisHighConnections
    expr: redis_connected_clients > 10000
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Redis connections too high"
      description: "Redis instance {{ $labels.instance }} has {{ $value }} connections (>10000)"
  - alert: LowCacheHitRate
    expr: (rate(redis_keyspace_hits_total[5m]) / (rate(redis_keyspace_hits_total[5m]) + rate(redis_keyspace_misses_total[5m]))) < 0.8
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "Cache hit rate low"
      description: "Cache hit rate {{ $value | humanizePercentage }} below 80%"
  - alert: RedisMemoryUsage
    expr: redis_memory_used_bytes / redis_memory_max_bytes > 0.85
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Redis memory usage high"
      description: "Redis instance {{ $labels.instance }} uses >85% of allocated memory"
  - alert: RedisReplicationDown
    expr: redis_master_link_status == 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "Redis replication down"
      description: "Redis instance {{ $labels.instance }} master‑slave replication broken"

Grafana Dashboard (key metrics)

Cache hit rate:

rate(redis_keyspace_hits_total[1m]) / (rate(redis_keyspace_hits_total[1m]) + rate(redis_keyspace_misses_total[1m])) * 100

Memory usage: redis_memory_used_bytes / redis_memory_max_bytes * 100 Commands per second: rate(redis_commands_total[1m]) Connected clients:

redis_connected_clients

Conclusion

Cache avalanche, penetration, and breakdown are the three most common Redis pitfalls. Understanding their distinct causes and symptoms enables targeted investigation:

Avalanche – prevent with random TTL, high‑availability, and multi‑level caching.

Penetration – mitigate by caching empty results, using Bloom filters, and applying request validation/rate limiting.

Breakdown – control concurrency with distributed locks, keep hot keys alive, and employ local fallbacks.

Adopt a layered cache architecture, instrument comprehensive metrics, and maintain runbooks for rapid diagnosis. By treating cache reliability as part of overall system availability, Redis can deliver its performance benefits without becoming a single point of failure.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

redistroubleshootingdistributed-lockbloom-filtercache-avalanchecache-breakdowncache-penetration
Ops Community
Written by

Ops Community

A leading IT operations community where professionals share and grow together.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.