Redis Cache Pitfalls: Penetration, Avalanche, Breakdown – Solutions & Real Cases
This article examines the three classic Redis caching problems—cache penetration, cache avalanche, and cache breakdown—illustrates real‑world incidents that caused system outages, and provides comprehensive mitigation techniques such as Bloom filters, null‑value caching, random expiration, multi‑level caches, logical expiration, and distributed locks, along with monitoring and disaster‑recovery practices.
Redis Cache Penetration, Avalanche, Breakdown: Solutions and Practice
As a senior operations engineer, I have dealt with countless Redis‑related incidents in production. Today I share three classic problems that often wake ops engineers at midnight and their complete solutions.
Preface: Those Midnight Calls
At 3 AM the phone rings urgently: "System down! Users cannot log in! Database CPU spikes to 100%!" Such scenarios are familiar to every ops engineer. In my 7‑year ops career, 80% of these incidents involve the three classic Redis cache issues: cache penetration, cache avalanche, cache breakdown .
1. Cache Penetration: Nightmare of Malicious Attacks
Problem Phenomenon
Users aggressively query data that does not exist in the database; each query bypasses the cache and hits the database directly, causing a sudden surge in database load.
Real Case Review
An e‑commerce platform suffered a malicious attack where the attacker generated random product IDs and queried product information. Since these IDs do not exist, the Redis cache missed every time and each request hit MySQL, exhausting the database connection pool.
Monitoring data:
Database QPS: from the usual 500/s to 8,000/s
Cache hit rate: dropped from 95% to 10%
System response time: rose from 50 ms to 5,000 ms
Solution Details
Solution 1: Bloom Filter (★★★★★)
The Bloom filter is the most elegant solution for cache penetration. Its core idea is "better to false‑reject than to let everything through".
Implementation Steps:
import redis
import mmh3
from bitarray import bitarray
class BloomFilter:
def __init__(self, capacity=1000000, error_rate=0.001):
"""Initialize Bloom filter
capacity: expected number of items
error_rate: false positive rate
"""
self.capacity = capacity
self.error_rate = error_rate
self.bit_num = self._get_bit_num()
self.hash_num = self._get_hash_num()
self.bit_array = bitarray(self.bit_num)
self.bit_array.setall(0)
self.redis_client = redis.Redis(host='localhost', port=6379, db=0)
def _get_bit_num(self):
"""Calculate bit array size"""
return int(-self.capacity * math.log(self.error_rate) / (math.log(2) ** 2))
def _get_hash_num(self):
"""Calculate number of hash functions"""
return int(self.bit_num * math.log(2) / self.capacity)
def _hash(self, value):
"""Multiple hash functions"""
h1 = mmh3.hash(value, 0)
h2 = mmh3.hash(value, h1)
for i in range(self.hash_num):
yield (h1 + i * h2) % self.bit_num
def add(self, value):
"""Add element"""
for index in self._hash(value):
self.bit_array[index] = 1
def is_exist(self, value):
"""Check if element exists"""
for index in self._hash(value):
if not self.bit_array[index]:
return False
return True
# Business layer usage
def get_product_info(product_id):
# First check Bloom filter
if not bloom_filter.is_exist(product_id):
return {"error": "Product does not exist"}
# Query cache
cache_key = f"product:{product_id}"
cached_data = redis_client.get(cache_key)
if cached_data:
return json.loads(cached_data)
# Query database
product = database.query_product(product_id)
if product:
redis_client.setex(cache_key, 3600, json.dumps(product))
return product
else:
# Cache empty value to prevent repeated queries
redis_client.setex(cache_key, 300, json.dumps({}))
return {"error": "Product does not exist"}Ops Deployment Recommendations:
Bloom filter data stored in Redis, supports cluster deployment
Periodically rebuild Bloom filter to avoid high false‑positive rate
Monitor Bloom filter capacity usage
Solution 2: Null‑Value Cache
A simple but effective approach is to also cache keys whose query result is empty.
def query_with_null_cache(key):
# 1. Query cache
cached_data = redis_client.get(f"cache:{key}")
if cached_data is not None:
return json.loads(cached_data) if cached_data != "null" else None
# 2. Query database
data = database.query(key)
# 3. Cache result (including null)
if data:
redis_client.setex(f"cache:{key}", 3600, json.dumps(data))
else:
redis_client.setex(f"cache:{key}", 300, "null")
return dataNotes:
Null‑value cache TTL should be shorter than normal data
Consider storage cost
Implement cleanup mechanisms to prevent garbage accumulation
2. Cache Avalanche: System Collapse Trigger
Problem Phenomenon
When a large number of caches expire simultaneously, massive requests directly hit the database, causing excessive pressure and possible crashes.
Bloody Lessons
A financial system during a promotion suffered a massive cache batch expiration; over 100,000 user queries hit the database at once, bringing the transaction system down for 45 minutes and causing losses exceeding 5 million.
Solution
Solution 1: Random Expiration Time
import random
import time
def set_cache_with_random_expire(key, data, base_expire=3600):
"""Set cache with random expiration to avoid avalanche
base_expire: base expiration in seconds
"""
random_factor = random.uniform(0.8, 1.2)
expire_time = int(base_expire * random_factor)
redis_client.setex(key, expire_time, json.dumps(data))
logger.info(f"Cache set: {key}, expire: {expire_time}s")
def batch_warm_up_cache(data_list):
"""Batch warm‑up cache to avoid simultaneous expiration"""
for data in data_list:
key = f"product:{data['id']}"
set_cache_with_random_expire(key, data, 3600)
time.sleep(0.01)Solution 2: Multi‑Level Cache Architecture
class MultiLevelCache:
def __init__(self):
self.l1_cache = {} # Local cache
self.l2_cache = redis.Redis() # Redis cache
self.l3_cache = memcached.Client(['127.0.0.1:11211']) # Memcached cache
def get(self, key):
# L1 hit
if key in self.l1_cache:
self.metrics.incr('l1_hit')
return self.l1_cache[key]
# L2 hit
l2_data = self.l2_cache.get(key)
if l2_data:
self.metrics.incr('l2_hit')
self.l1_cache[key] = json.loads(l2_data)
return self.l1_cache[key]
# L3 hit
l3_data = self.l3_cache.get(key)
if l3_data:
self.metrics.incr('l3_hit')
self.l1_cache[key] = l3_data
self.l2_cache.setex(key, 3600, json.dumps(l3_data))
return l3_data
# Miss
self.metrics.incr('cache_miss')
return None
def set(self, key, value, expire=3600):
# Write to all layers
self.l1_cache[key] = value
self.l2_cache.setex(key, expire, json.dumps(value))
self.l3_cache.set(key, value, time=expire)Solution 3: Mutex Lock Rebuild Cache
import threading
from contextlib import contextmanager
class CacheRebuildManager:
def __init__(self):
self.rebuilding_keys = set()
self.lock = threading.Lock()
@contextmanager
def rebuild_lock(self, key):
"""Mutex lock for cache rebuild"""
with self.lock:
if key in self.rebuilding_keys:
time.sleep(0.1)
yield False
else:
self.rebuilding_keys.add(key)
try:
yield True
finally:
self.rebuilding_keys.discard(key)
rebuild_manager = CacheRebuildManager()
def get_data_with_rebuild_protection(key):
cached_data = redis_client.get(key)
if cached_data:
return json.loads(cached_data)
with rebuild_manager.rebuild_lock(key) as should_rebuild:
if should_rebuild:
data = database.query(key)
if data:
expire_time = random.randint(3600, 4320) # 1‑1.2 h
redis_client.setex(key, expire_time, json.dumps(data))
return data
else:
time.sleep(0.1)
cached_data = redis_client.get(key)
return json.loads(cached_data) if cached_data else None
else:
return database.query(key)3. Cache Breakdown: Hot Data Trap
Problem Description
A hot key suddenly expires, causing a flood of requests to query the database simultaneously, leading to a spike in load.
Classic Case
A popular video platform's hot video cache expired, instantly generating over 5,000 concurrent database queries, exhausting the connection pool and rendering the video service unavailable.
Solution
Solution 1: Never Expire + Logical Expiration
import json
import time
import threading
class LogicalExpireCache:
def __init__(self):
self.redis_client = redis.Redis()
self.executor = ThreadPoolExecutor(max_workers=10)
def set_with_logical_expire(self, key, data, expire_seconds):
"""Set cache with logical expiration"""
cache_data = {'data': data, 'expire_time': time.time() + expire_seconds}
self.redis_client.set(key, json.dumps(cache_data))
def get_with_logical_expire(self, key):
"""Get cache and check logical expiration"""
cached_json = self.redis_client.get(key)
if not cached_json:
return None
cached_data = json.loads(cached_json)
if time.time() < cached_data['expire_time']:
return cached_data['data']
else:
self.executor.submit(self._refresh_cache_async, key)
return cached_data['data']
def _refresh_cache_async(self, key):
"""Asynchronously refresh cache"""
lock_key = f"lock:{key}"
if self.redis_client.set(lock_key, "1", nx=True, ex=10):
new_data = database.query(key)
if new_data:
self.set_with_logical_expire(key, new_data, 3600)
self.redis_client.delete(lock_key)Solution 2: Distributed Lock + Double Check
import uuid
import time
class DistributedLock:
def __init__(self, redis_client, key, timeout=10):
self.redis_client = redis_client
self.key = f"lock:{key}"
self.timeout = timeout
self.identifier = str(uuid.uuid4())
def __enter__(self):
end_time = time.time() + self.timeout
while time.time() < end_time:
if self.redis_client.set(self.key, self.identifier, nx=True, ex=self.timeout):
return self
time.sleep(0.001)
raise TimeoutError("Failed to acquire distributed lock")
def __exit__(self, exc_type, exc_val, exc_tb):
unlock_script = """
if redis.call("get", KEYS[1]) == ARGV[1] then
return redis.call("del", KEYS[1])
else
return 0
end
"""
self.redis_client.eval(unlock_script, 1, self.key, self.identifier)
def get_data_with_distributed_lock(key):
cached_data = redis_client.get(key)
if cached_data:
return json.loads(cached_data)
try:
with DistributedLock(redis_client, key, timeout=5):
cached_data = redis_client.get(key)
if cached_data:
return json.loads(cached_data)
data = database.query(key)
if data:
redis_client.setex(key, 3600, json.dumps(data))
return data
except TimeoutError:
logger.warning(f"Lock timeout, fallback to DB query: {key}")
return database.query(key)4. Production Best Practices
Monitoring & Alert System
class CacheMonitor:
def __init__(self):
self.metrics = {}
def record_cache_hit_rate(self):
"""Monitor cache hit rate"""
hit_rate = self.redis_client.get('cache_hit_rate')
if hit_rate and float(hit_rate) < 0.8:
self.send_alert("Low cache hit rate", f"Current hit rate: {hit_rate}")
def monitor_redis_memory(self):
"""Monitor Redis memory usage"""
info = self.redis_client.info('memory')
usage = info['used_memory'] / info['maxmemory']
if usage > 0.85:
self.send_alert("Redis memory high", f"Usage: {usage:.2%}")
def check_slow_queries(self):
"""Check slow queries"""
slow_logs = self.redis_client.slowlog_get(10)
for log in slow_logs:
if log['duration'] > 10000:
self.send_alert("Slow query detected", f"Duration: {log['duration']}µs, Command: {log['command']}")
def monitoring_task():
monitor = CacheMonitor()
while True:
try:
monitor.record_cache_hit_rate()
monitor.monitor_redis_memory()
monitor.check_slow_queries()
except Exception as e:
logger.error(f"Monitoring task error: {e}")
time.sleep(60)Cache Warm‑up Strategy
class CacheWarmUp:
def __init__(self):
self.redis_client = redis.Redis()
self.thread_pool = ThreadPoolExecutor(max_workers=20)
def warm_up_hot_data(self):
"""Warm up hot product data"""
hot_products = database.query("SELECT id FROM products WHERE is_hot = 1")
futures = []
for product in hot_products:
future = self.thread_pool.submit(self._warm_single_product, product['id'])
futures.append(future)
success = 0
for future in futures:
try:
future.result(timeout=30)
success += 1
except Exception as e:
logger.error(f"Warm‑up failed: {e}")
logger.info(f"Cache warm‑up completed, success: {success}/{len(hot_products)}")
def _warm_single_product(self, product_id):
"""Warm up a single product cache"""
product_info = database.query_product(product_id)
if product_info:
cache_key = f"product:{product_id}"
expire_time = random.randint(3600, 4320)
self.redis_client.setex(cache_key, expire_time, json.dumps(product_info))Disaster Recovery Plan
class CacheDisasterRecovery:
def __init__(self):
self.master_redis = redis.Redis(host='master-redis')
self.slave_redis = redis.Redis(host='slave-redis')
self.local_cache = {}
def get_with_fallback(self, key):
"""Multi‑level fallback query"""
try:
data = self.master_redis.get(key)
if data:
return json.loads(data)
except Exception as e:
logger.warning(f"Master Redis error: {e}")
try:
data = self.slave_redis.get(key)
if data:
return json.loads(data)
except Exception as e:
logger.warning(f"Slave Redis error: {e}")
if key in self.local_cache:
item = self.local_cache[key]
if time.time() < item['expire_time']:
logger.info(f"Local cache hit: {key}")
return item['data']
try:
data = database.query(key)
if data:
self.local_cache[key] = {'data': data, 'expire_time': time.time() + 300}
return data
except Exception as e:
logger.error(f"DB query error: {e}")
return None5. Performance Optimization & Tuning
Redis Configuration Optimization
# redis.conf production recommended settings
maxmemory 8gb
maxmemory-policy allkeys-lru
# Persistence
save 900 1
save 300 10
save 60 10000
stop-writes-on-bgsave-error yes
rdbcompression yes
rdbchecksum yes
# Network
tcp-keepalive 300
timeout 0
# Slowlog
slowlog-log-slower-than 10000
slowlog-max-len 128
# Clients
maxclients 10000Connection Pool Configuration
import redis.connection
redis_pool = redis.ConnectionPool(
host='localhost',
port=6379,
db=0,
max_connections=100,
retry_on_timeout=True,
health_check_interval=30,
socket_connect_timeout=5,
socket_timeout=5,
)
redis_client = redis.Redis(connection_pool=redis_pool)6. Fault Diagnosis Handbook
Common Issue Diagnosis
# 1. Check Redis memory usage
redis-cli info memory
# 2. Monitor slow queries
redis-cli slowlog get 10
# 3. View client connections
redis-cli info clients
# 4. Monitor keyspace hit rate
redis-cli info stats | grep keyspace
# 5. View expired key statistics
redis-cli info keyspaceEmergency Handling Script
#!/usr/bin/env python3
"""Redis emergency handling tool"""
import redis
import sys
import time
class RedisEmergencyKit:
def __init__(self, host='localhost', port=6379):
self.redis_client = redis.Redis(host=host, port=port)
def flush_expired_keys(self):
"""Clean up expired keys"""
print("Starting cleanup of expired keys...")
count = 0
for key in self.redis_client.scan_iter():
if self.redis_client.ttl(key) == 0:
self.redis_client.delete(key)
count += 1
print(f"Cleanup completed, deleted {count} expired keys")
def analyze_big_keys(self, limit=10):
"""Analyze biggest keys"""
print(f"Analyzing top {limit} biggest keys...")
big_keys = []
for key in self.redis_client.scan_iter():
memory = self.redis_client.memory_usage(key)
if memory:
big_keys.append((key.decode(), memory))
big_keys.sort(key=lambda x: x[1], reverse=True)
for key, memory in big_keys[:limit]:
print(f"{key}: {memory/1024:.2f} KB")
def emergency_cache_clear(self, pattern):
"""Urgently clear cache matching pattern"""
print(f"Clearing cache with pattern {pattern}...")
count = 0
for key in self.redis_client.scan_iter(match=pattern):
self.redis_client.delete(key)
count += 1
print(f"Clear completed, deleted {count} keys")
if __name__ == "__main__":
if len(sys.argv) < 2:
print("Usage: python emergency_kit.py <command>")
print("Commands: flush_expired | analyze_big_keys | clear_pattern <pattern>")
sys.exit(1)
kit = RedisEmergencyKit()
cmd = sys.argv[1]
if cmd == "flush_expired":
kit.flush_expired_keys()
elif cmd == "analyze_big_keys":
kit.analyze_big_keys()
elif cmd == "clear_pattern" and len(sys.argv) > 2:
kit.emergency_cache_clear(sys.argv[2])
else:
print("Unknown command")Conclusion
Through this in‑depth analysis we learned the essence of the three classic Redis problems and their solutions:
Cache Penetration : use Bloom filters or null‑value caching to build the first line of defense.
Cache Avalanche : disperse risk with random expiration, multi‑level caches, or mutex locks.
Cache Breakdown : adopt logical expiration or distributed locks to protect hot data.
As operations engineers, we must not only master these solutions but also establish comprehensive monitoring, warm‑up mechanisms, and emergency plans. Remember, good ops is not about never having failures, but about rapid response and recovery.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
