Operations 25 min read

Redis Cache Pitfalls: Penetration, Avalanche, Breakdown – Solutions & Real Cases

This article examines the three classic Redis caching problems—cache penetration, cache avalanche, and cache breakdown—illustrates real‑world incidents that caused system outages, and provides comprehensive mitigation techniques such as Bloom filters, null‑value caching, random expiration, multi‑level caches, logical expiration, and distributed locks, along with monitoring and disaster‑recovery practices.

MaGe Linux Operations
MaGe Linux Operations
MaGe Linux Operations
Redis Cache Pitfalls: Penetration, Avalanche, Breakdown – Solutions & Real Cases

Redis Cache Penetration, Avalanche, Breakdown: Solutions and Practice

As a senior operations engineer, I have dealt with countless Redis‑related incidents in production. Today I share three classic problems that often wake ops engineers at midnight and their complete solutions.

Preface: Those Midnight Calls

At 3 AM the phone rings urgently: "System down! Users cannot log in! Database CPU spikes to 100%!" Such scenarios are familiar to every ops engineer. In my 7‑year ops career, 80% of these incidents involve the three classic Redis cache issues: cache penetration, cache avalanche, cache breakdown .

1. Cache Penetration: Nightmare of Malicious Attacks

Problem Phenomenon

Users aggressively query data that does not exist in the database; each query bypasses the cache and hits the database directly, causing a sudden surge in database load.

Real Case Review

An e‑commerce platform suffered a malicious attack where the attacker generated random product IDs and queried product information. Since these IDs do not exist, the Redis cache missed every time and each request hit MySQL, exhausting the database connection pool.

Monitoring data:

Database QPS: from the usual 500/s to 8,000/s

Cache hit rate: dropped from 95% to 10%

System response time: rose from 50 ms to 5,000 ms

Solution Details

Solution 1: Bloom Filter (★★★★★)

The Bloom filter is the most elegant solution for cache penetration. Its core idea is "better to false‑reject than to let everything through".

Implementation Steps:

import redis
import mmh3
from bitarray import bitarray

class BloomFilter:
    def __init__(self, capacity=1000000, error_rate=0.001):
        """Initialize Bloom filter
        capacity: expected number of items
        error_rate: false positive rate
        """
        self.capacity = capacity
        self.error_rate = error_rate
        self.bit_num = self._get_bit_num()
        self.hash_num = self._get_hash_num()
        self.bit_array = bitarray(self.bit_num)
        self.bit_array.setall(0)
        self.redis_client = redis.Redis(host='localhost', port=6379, db=0)

    def _get_bit_num(self):
        """Calculate bit array size"""
        return int(-self.capacity * math.log(self.error_rate) / (math.log(2) ** 2))

    def _get_hash_num(self):
        """Calculate number of hash functions"""
        return int(self.bit_num * math.log(2) / self.capacity)

    def _hash(self, value):
        """Multiple hash functions"""
        h1 = mmh3.hash(value, 0)
        h2 = mmh3.hash(value, h1)
        for i in range(self.hash_num):
            yield (h1 + i * h2) % self.bit_num

    def add(self, value):
        """Add element"""
        for index in self._hash(value):
            self.bit_array[index] = 1

    def is_exist(self, value):
        """Check if element exists"""
        for index in self._hash(value):
            if not self.bit_array[index]:
                return False
        return True

# Business layer usage
def get_product_info(product_id):
    # First check Bloom filter
    if not bloom_filter.is_exist(product_id):
        return {"error": "Product does not exist"}
    # Query cache
    cache_key = f"product:{product_id}"
    cached_data = redis_client.get(cache_key)
    if cached_data:
        return json.loads(cached_data)
    # Query database
    product = database.query_product(product_id)
    if product:
        redis_client.setex(cache_key, 3600, json.dumps(product))
        return product
    else:
        # Cache empty value to prevent repeated queries
        redis_client.setex(cache_key, 300, json.dumps({}))
        return {"error": "Product does not exist"}

Ops Deployment Recommendations:

Bloom filter data stored in Redis, supports cluster deployment

Periodically rebuild Bloom filter to avoid high false‑positive rate

Monitor Bloom filter capacity usage

Solution 2: Null‑Value Cache

A simple but effective approach is to also cache keys whose query result is empty.

def query_with_null_cache(key):
    # 1. Query cache
    cached_data = redis_client.get(f"cache:{key}")
    if cached_data is not None:
        return json.loads(cached_data) if cached_data != "null" else None
    # 2. Query database
    data = database.query(key)
    # 3. Cache result (including null)
    if data:
        redis_client.setex(f"cache:{key}", 3600, json.dumps(data))
    else:
        redis_client.setex(f"cache:{key}", 300, "null")
    return data

Notes:

Null‑value cache TTL should be shorter than normal data

Consider storage cost

Implement cleanup mechanisms to prevent garbage accumulation

2. Cache Avalanche: System Collapse Trigger

Problem Phenomenon

When a large number of caches expire simultaneously, massive requests directly hit the database, causing excessive pressure and possible crashes.

Bloody Lessons

A financial system during a promotion suffered a massive cache batch expiration; over 100,000 user queries hit the database at once, bringing the transaction system down for 45 minutes and causing losses exceeding 5 million.

Solution

Solution 1: Random Expiration Time

import random
import time

def set_cache_with_random_expire(key, data, base_expire=3600):
    """Set cache with random expiration to avoid avalanche
    base_expire: base expiration in seconds
    """
    random_factor = random.uniform(0.8, 1.2)
    expire_time = int(base_expire * random_factor)
    redis_client.setex(key, expire_time, json.dumps(data))
    logger.info(f"Cache set: {key}, expire: {expire_time}s")

def batch_warm_up_cache(data_list):
    """Batch warm‑up cache to avoid simultaneous expiration"""
    for data in data_list:
        key = f"product:{data['id']}"
        set_cache_with_random_expire(key, data, 3600)
        time.sleep(0.01)

Solution 2: Multi‑Level Cache Architecture

class MultiLevelCache:
    def __init__(self):
        self.l1_cache = {}  # Local cache
        self.l2_cache = redis.Redis()  # Redis cache
        self.l3_cache = memcached.Client(['127.0.0.1:11211'])  # Memcached cache

    def get(self, key):
        # L1 hit
        if key in self.l1_cache:
            self.metrics.incr('l1_hit')
            return self.l1_cache[key]
        # L2 hit
        l2_data = self.l2_cache.get(key)
        if l2_data:
            self.metrics.incr('l2_hit')
            self.l1_cache[key] = json.loads(l2_data)
            return self.l1_cache[key]
        # L3 hit
        l3_data = self.l3_cache.get(key)
        if l3_data:
            self.metrics.incr('l3_hit')
            self.l1_cache[key] = l3_data
            self.l2_cache.setex(key, 3600, json.dumps(l3_data))
            return l3_data
        # Miss
        self.metrics.incr('cache_miss')
        return None

    def set(self, key, value, expire=3600):
        # Write to all layers
        self.l1_cache[key] = value
        self.l2_cache.setex(key, expire, json.dumps(value))
        self.l3_cache.set(key, value, time=expire)

Solution 3: Mutex Lock Rebuild Cache

import threading
from contextlib import contextmanager

class CacheRebuildManager:
    def __init__(self):
        self.rebuilding_keys = set()
        self.lock = threading.Lock()

    @contextmanager
    def rebuild_lock(self, key):
        """Mutex lock for cache rebuild"""
        with self.lock:
            if key in self.rebuilding_keys:
                time.sleep(0.1)
                yield False
            else:
                self.rebuilding_keys.add(key)
                try:
                    yield True
                finally:
                    self.rebuilding_keys.discard(key)

rebuild_manager = CacheRebuildManager()

def get_data_with_rebuild_protection(key):
    cached_data = redis_client.get(key)
    if cached_data:
        return json.loads(cached_data)
    with rebuild_manager.rebuild_lock(key) as should_rebuild:
        if should_rebuild:
            data = database.query(key)
            if data:
                expire_time = random.randint(3600, 4320)  # 1‑1.2 h
                redis_client.setex(key, expire_time, json.dumps(data))
                return data
            else:
                time.sleep(0.1)
                cached_data = redis_client.get(key)
                return json.loads(cached_data) if cached_data else None
        else:
            return database.query(key)

3. Cache Breakdown: Hot Data Trap

Problem Description

A hot key suddenly expires, causing a flood of requests to query the database simultaneously, leading to a spike in load.

Classic Case

A popular video platform's hot video cache expired, instantly generating over 5,000 concurrent database queries, exhausting the connection pool and rendering the video service unavailable.

Solution

Solution 1: Never Expire + Logical Expiration

import json
import time
import threading

class LogicalExpireCache:
    def __init__(self):
        self.redis_client = redis.Redis()
        self.executor = ThreadPoolExecutor(max_workers=10)

    def set_with_logical_expire(self, key, data, expire_seconds):
        """Set cache with logical expiration"""
        cache_data = {'data': data, 'expire_time': time.time() + expire_seconds}
        self.redis_client.set(key, json.dumps(cache_data))

    def get_with_logical_expire(self, key):
        """Get cache and check logical expiration"""
        cached_json = self.redis_client.get(key)
        if not cached_json:
            return None
        cached_data = json.loads(cached_json)
        if time.time() < cached_data['expire_time']:
            return cached_data['data']
        else:
            self.executor.submit(self._refresh_cache_async, key)
            return cached_data['data']

    def _refresh_cache_async(self, key):
        """Asynchronously refresh cache"""
        lock_key = f"lock:{key}"
        if self.redis_client.set(lock_key, "1", nx=True, ex=10):
            new_data = database.query(key)
            if new_data:
                self.set_with_logical_expire(key, new_data, 3600)
            self.redis_client.delete(lock_key)

Solution 2: Distributed Lock + Double Check

import uuid
import time

class DistributedLock:
    def __init__(self, redis_client, key, timeout=10):
        self.redis_client = redis_client
        self.key = f"lock:{key}"
        self.timeout = timeout
        self.identifier = str(uuid.uuid4())

    def __enter__(self):
        end_time = time.time() + self.timeout
        while time.time() < end_time:
            if self.redis_client.set(self.key, self.identifier, nx=True, ex=self.timeout):
                return self
            time.sleep(0.001)
        raise TimeoutError("Failed to acquire distributed lock")

    def __exit__(self, exc_type, exc_val, exc_tb):
        unlock_script = """
        if redis.call("get", KEYS[1]) == ARGV[1] then
            return redis.call("del", KEYS[1])
        else
            return 0
        end
        """
        self.redis_client.eval(unlock_script, 1, self.key, self.identifier)

def get_data_with_distributed_lock(key):
    cached_data = redis_client.get(key)
    if cached_data:
        return json.loads(cached_data)
    try:
        with DistributedLock(redis_client, key, timeout=5):
            cached_data = redis_client.get(key)
            if cached_data:
                return json.loads(cached_data)
            data = database.query(key)
            if data:
                redis_client.setex(key, 3600, json.dumps(data))
                return data
    except TimeoutError:
        logger.warning(f"Lock timeout, fallback to DB query: {key}")
        return database.query(key)

4. Production Best Practices

Monitoring & Alert System

class CacheMonitor:
    def __init__(self):
        self.metrics = {}

    def record_cache_hit_rate(self):
        """Monitor cache hit rate"""
        hit_rate = self.redis_client.get('cache_hit_rate')
        if hit_rate and float(hit_rate) < 0.8:
            self.send_alert("Low cache hit rate", f"Current hit rate: {hit_rate}")

    def monitor_redis_memory(self):
        """Monitor Redis memory usage"""
        info = self.redis_client.info('memory')
        usage = info['used_memory'] / info['maxmemory']
        if usage > 0.85:
            self.send_alert("Redis memory high", f"Usage: {usage:.2%}")

    def check_slow_queries(self):
        """Check slow queries"""
        slow_logs = self.redis_client.slowlog_get(10)
        for log in slow_logs:
            if log['duration'] > 10000:
                self.send_alert("Slow query detected", f"Duration: {log['duration']}µs, Command: {log['command']}")

def monitoring_task():
    monitor = CacheMonitor()
    while True:
        try:
            monitor.record_cache_hit_rate()
            monitor.monitor_redis_memory()
            monitor.check_slow_queries()
        except Exception as e:
            logger.error(f"Monitoring task error: {e}")
        time.sleep(60)

Cache Warm‑up Strategy

class CacheWarmUp:
    def __init__(self):
        self.redis_client = redis.Redis()
        self.thread_pool = ThreadPoolExecutor(max_workers=20)

    def warm_up_hot_data(self):
        """Warm up hot product data"""
        hot_products = database.query("SELECT id FROM products WHERE is_hot = 1")
        futures = []
        for product in hot_products:
            future = self.thread_pool.submit(self._warm_single_product, product['id'])
            futures.append(future)
        success = 0
        for future in futures:
            try:
                future.result(timeout=30)
                success += 1
            except Exception as e:
                logger.error(f"Warm‑up failed: {e}")
        logger.info(f"Cache warm‑up completed, success: {success}/{len(hot_products)}")

    def _warm_single_product(self, product_id):
        """Warm up a single product cache"""
        product_info = database.query_product(product_id)
        if product_info:
            cache_key = f"product:{product_id}"
            expire_time = random.randint(3600, 4320)
            self.redis_client.setex(cache_key, expire_time, json.dumps(product_info))

Disaster Recovery Plan

class CacheDisasterRecovery:
    def __init__(self):
        self.master_redis = redis.Redis(host='master-redis')
        self.slave_redis = redis.Redis(host='slave-redis')
        self.local_cache = {}

    def get_with_fallback(self, key):
        """Multi‑level fallback query"""
        try:
            data = self.master_redis.get(key)
            if data:
                return json.loads(data)
        except Exception as e:
            logger.warning(f"Master Redis error: {e}")
        try:
            data = self.slave_redis.get(key)
            if data:
                return json.loads(data)
        except Exception as e:
            logger.warning(f"Slave Redis error: {e}")
        if key in self.local_cache:
            item = self.local_cache[key]
            if time.time() < item['expire_time']:
                logger.info(f"Local cache hit: {key}")
                return item['data']
        try:
            data = database.query(key)
            if data:
                self.local_cache[key] = {'data': data, 'expire_time': time.time() + 300}
                return data
        except Exception as e:
            logger.error(f"DB query error: {e}")
        return None

5. Performance Optimization & Tuning

Redis Configuration Optimization

# redis.conf production recommended settings
maxmemory 8gb
maxmemory-policy allkeys-lru

# Persistence
save 900 1
save 300 10
save 60 10000
stop-writes-on-bgsave-error yes
rdbcompression yes
rdbchecksum yes

# Network
tcp-keepalive 300
timeout 0

# Slowlog
slowlog-log-slower-than 10000
slowlog-max-len 128

# Clients
maxclients 10000

Connection Pool Configuration

import redis.connection

redis_pool = redis.ConnectionPool(
    host='localhost',
    port=6379,
    db=0,
    max_connections=100,
    retry_on_timeout=True,
    health_check_interval=30,
    socket_connect_timeout=5,
    socket_timeout=5,
)

redis_client = redis.Redis(connection_pool=redis_pool)

6. Fault Diagnosis Handbook

Common Issue Diagnosis

# 1. Check Redis memory usage
redis-cli info memory

# 2. Monitor slow queries
redis-cli slowlog get 10

# 3. View client connections
redis-cli info clients

# 4. Monitor keyspace hit rate
redis-cli info stats | grep keyspace

# 5. View expired key statistics
redis-cli info keyspace

Emergency Handling Script

#!/usr/bin/env python3
"""Redis emergency handling tool"""

import redis
import sys
import time

class RedisEmergencyKit:
    def __init__(self, host='localhost', port=6379):
        self.redis_client = redis.Redis(host=host, port=port)

    def flush_expired_keys(self):
        """Clean up expired keys"""
        print("Starting cleanup of expired keys...")
        count = 0
        for key in self.redis_client.scan_iter():
            if self.redis_client.ttl(key) == 0:
                self.redis_client.delete(key)
                count += 1
        print(f"Cleanup completed, deleted {count} expired keys")

    def analyze_big_keys(self, limit=10):
        """Analyze biggest keys"""
        print(f"Analyzing top {limit} biggest keys...")
        big_keys = []
        for key in self.redis_client.scan_iter():
            memory = self.redis_client.memory_usage(key)
            if memory:
                big_keys.append((key.decode(), memory))
        big_keys.sort(key=lambda x: x[1], reverse=True)
        for key, memory in big_keys[:limit]:
            print(f"{key}: {memory/1024:.2f} KB")

    def emergency_cache_clear(self, pattern):
        """Urgently clear cache matching pattern"""
        print(f"Clearing cache with pattern {pattern}...")
        count = 0
        for key in self.redis_client.scan_iter(match=pattern):
            self.redis_client.delete(key)
            count += 1
        print(f"Clear completed, deleted {count} keys")

if __name__ == "__main__":
    if len(sys.argv) < 2:
        print("Usage: python emergency_kit.py <command>")
        print("Commands: flush_expired | analyze_big_keys | clear_pattern <pattern>")
        sys.exit(1)
    kit = RedisEmergencyKit()
    cmd = sys.argv[1]
    if cmd == "flush_expired":
        kit.flush_expired_keys()
    elif cmd == "analyze_big_keys":
        kit.analyze_big_keys()
    elif cmd == "clear_pattern" and len(sys.argv) > 2:
        kit.emergency_cache_clear(sys.argv[2])
    else:
        print("Unknown command")

Conclusion

Through this in‑depth analysis we learned the essence of the three classic Redis problems and their solutions:

Cache Penetration : use Bloom filters or null‑value caching to build the first line of defense.

Cache Avalanche : disperse risk with random expiration, multi‑level caches, or mutex locks.

Cache Breakdown : adopt logical expiration or distributed locks to protect hot data.

As operations engineers, we must not only master these solutions but also establish comprehensive monitoring, warm‑up mechanisms, and emergency plans. Remember, good ops is not about never having failures, but about rapid response and recovery.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

performanceCachePythonredis
MaGe Linux Operations
Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.