How to Diagnose and Fix Sudden Redis Slowdowns: A Complete Five‑Step Guide
This article provides a systematic, step‑by‑step methodology for identifying the root causes of Redis performance degradation—including big keys, slow queries, expiration spikes, memory limits, fork latency, AOF flushing, memory fragmentation, swap usage, huge pages, and CPU binding—and offers immediate mitigation tactics as well as long‑term architectural solutions to restore and maintain high throughput.
Impact of Redis Slowdowns
Redis is a high‑performance in‑memory data store whose sub‑millisecond latency is critical for many services. When latency spikes, the entire application stack can suffer, leading to time‑outs, degraded user experience, and even service outages.
Five‑Step Root‑Cause Resolution Framework
Precise Diagnosis
Link tracing – instrument services with distributed tracing and isolate the Redis call latency.
Baseline latency test – run redis-cli -h 127.0.0.1 -p 6379 --intrinsic-latency 60 and redis-cli -h 127.0.0.1 -p 6379 --latency-history -i 1 on the Redis host to capture intrinsic latency.
Slow‑log analysis – enable the slow log and query it with redis-cli SLOWLOG GET. Adjust the threshold via redis-cli CONFIG GET slowlog-log-slower-than or redis-cli CONFIG SET slowlog-log-slower-than 5000 (5 ms).
Problem‑specific commands
High‑complexity commands: redis-cli SLOWLOG GET + redis-cli CONFIG GET slowlog-log-slower-than Big‑key scanning: redis-cli --bigkeys and redis-cli MEMORY USAGE <key> Key‑expiration spikes: redis-cli INFO STATS Persistence blocking: redis-cli INFO PERSISTENCE Memory fragmentation: redis-cli INFO MEMORY CPU binding issues: redis-cli INFO SERVER Rapid 5‑Minute Mitigation
Big‑key eviction policy – CONFIG SET maxmemory-policy volatile-ttl to preferentially evict expiring keys.
Disable dangerous commands – e.g., CONFIG SET rename-command KEYS "".
Lower slow‑log threshold – CONFIG SET slowlog-log-slower-than 10000 (10 ms).
Reduce active expiration effort – CONFIG SET active-expire-effort 1.
Temporarily disable AOF fsync – CONFIG SET appendfsync no.
Disable Transparent Huge Pages (THP) – echo never > /sys/kernel/mm/transparent_hugepage/enabled.
Bind Redis to specific CPUs – e.g., taskset -c 0,2,4,6 ./redis-server or Redis 6.0 component‑level CPU lists (see step 5).
Graceful Downgrade (Feature Sacrifice)
Turn off non‑critical services (leaderboards, analytics) and read directly from the primary database.
Migrate infrequently accessed big keys to external stores such as MySQL or MongoDB.
Enable read‑only replicas and route read traffic to slaves ( slave-read-only yes).
Apply rate‑limiting on low‑priority APIs via Nginx limit_req_zone.
Performance Boost (High‑Concurrency Mode)
Pipeline batch processing – combine 100+ commands per round‑trip to cut network latency by ~90 %.
Tune client connection pools (e.g., Java JedisPool maxTotal=5000 maxIdle=100).
Introduce a local cache (Guava, Caffeine) to shield Redis from cache‑penetration bursts.
Deploy a proxy layer (Twemproxy) for temporary read/write separation.
Long‑Term Architectural Remedy
Adopt Redis Cluster for sharding large keys and enable data compression (Protobuf + zstd).
Implement multi‑level caching: local → Redis shard → persistent DB.
Scatter key expirations with random offsets: EXPIREAT key $(date +%s)$((RANDOM%300)).
Switch to mixed RDB+AOF persistence with incremental fsync ( appendonly yes, aof-rewrite-incremental-fsync yes).
Enable lazy‑free eviction ( CONFIG SET lazyfree-lazy-eviction yes) to offload memory reclamation to background threads.
Permanently disable THP (add the echo command to /etc/rc.local).
Configure component‑level CPU binding in Redis 6.0:
server_cpulist 0-7:2
bio_cpulist 1,3
aof_rewrite_cpulist 8-11
bgsave_cpulist 1,10-11Key Problem Areas and Detailed Remedies
Big‑Key Issues
A big key is any string >10 MB or any collection with >10 k elements. Such keys cause memory allocation/release overhead, cluster imbalance, bandwidth saturation, and replication delays. Detect them with redis-cli --bigkeys -i 0.01, which samples the keyspace and reports the largest key per data type. Mitigation strategies:
Avoid storing large blobs in Redis; keep binary data in object storage.
Split large values across multiple keys (e.g., SET user:1000:profile:part1 …).
Use UNLINK (Redis 4.0+) or enable lazy‑free deletion ( CONFIG SET lazyfree-lazy-user-del yes) for asynchronous memory release.
Slow Query Accumulation
Commands with O(N) or higher complexity (e.g., KEYS *, HGETALL, SMEMBERS) can block the single Redis thread. Replace them with iterative scans ( SCAN, HSCAN, SSCAN, ZRANGE 0 99) and limit the result set (N ≤ 300). Disable or rename dangerous commands in production.
Mass Expiration
Redis uses a hybrid passive + active expiration algorithm. When many keys expire simultaneously, the active loop runs on the main thread, causing temporary latency spikes. Countermeasures:
Scatter expirations with random offsets (see step 5).
Enable lazy expiration ( CONFIG SET lazyfree-lazy-expire yes) so the memory free is performed by a background thread.
Monitor expired_keys via INFO; a sudden surge indicates expiration‑induced latency.
Persistence Fork Overhead
Both RDB snapshots and AOF rewrites fork a child process. Fork copies the page table; for a 24 GB instance the page‑table size is ~48 MB, and copying can take seconds on virtualized hardware. The fork runs in the main thread, blocking client requests. Diagnose with INFO | grep latest_fork_usec (value in µs). Mitigation:
Keep instance size ≤10 GB or split data across multiple instances.
Prefer AOF‑only mode or schedule RDB snapshots during low‑traffic windows.
Use SSDs to reduce child‑process write latency.
On Redis 6.0+, bind the RDB child process to a dedicated CPU list ( bgsave_cpulist).
AOF Flush Latency
AOF has three fsync policies:
always – every write is fsynced; highest durability but highest latency.
everysec – background thread fsyncs once per second; moderate latency, up to 1 s data loss.
no – relies on OS periodic flush; lowest latency, possible multi‑second data loss.
When disk I/O is saturated, the background fsync can block the main thread because the write system call and the fsync share a file lock. A practical fix is to disable fsync during AOF rewrite: CONFIG SET no-appendfsync-on-rewrite yes. This reduces contention at the cost of a larger window of potential data loss, acceptable for pure‑cache workloads.
Memory Limit Eviction
When maxmemory is reached Redis evicts keys according to the configured policy. LRU/LFU policies require tracking access frequency and can add ~25 ms latency per eviction cycle. For latency‑sensitive workloads, switch to a random policy ( allkeys-random or volatile-random) which avoids the bookkeeping overhead. Enable lazy eviction ( CONFIG SET lazyfree-lazy-eviction yes) to make the actual memory free asynchronous.
Transparent Huge Pages (THP)
THP enlarges memory pages to 2 MB, which dramatically increases copy‑on‑write cost during fork. Disable it permanently with echo never > /sys/kernel/mm/transparent_hugepage/enabled and add the command to /etc/rc.local to survive reboots.
Swap Usage
If the OS swaps Redis memory to disk, latency degrades by orders of magnitude. Detect swap per process with:
cat /proc/$(pidof redis-server)/smaps | grep -i swapKey indicators: any Swap: line equal to or larger than the corresponding Size: line, or total swap > 100 MB. Mitigation:
Provision sufficient RAM or disable swap ( swapoff -a).
Isolate Redis on a dedicated host.
For a swapped instance, perform a controlled restart after freeing memory or promote a replica.
Memory Fragmentation
Fragmentation ratio > 1.5 indicates that the OS‑reported RSS exceeds the actual memory used by Redis. In Redis 4.0+ enable active defragmentation:
CONFIG SET activedefrag yes
CONFIG SET active-defrag-threshold-lower 10
CONFIG SET active-defrag-threshold-upper 100
CONFIG SET active-defrag-cycle-min 1
CONFIG SET active-defrag-cycle-max 25Older versions require a full restart to reclaim fragmented memory.
CPU Binding Pitfalls
Binding Redis to a single logical core can cause contention between the main thread and background tasks (RDB/AOF, lazy‑free). Redis 6.0 introduces component‑level CPU lists to separate workloads:
server_cpulist 0-7:2 # main thread & I/O
bio_cpulist 1,3 # background I/O threads
aof_rewrite_cpulist 8-11
bgsave_cpulist 1,10-11Bind to multiple logical cores on the same physical core (e.g., 0,2,4,6) to keep cache locality while avoiding hyper‑thread contention.
Monitoring & Automation
Key Prometheus‑compatible metrics to watch: redis_memory_usage_bytes – flag keys > 10 MB. redis_memory_fragmentation_ratio – alert when > 1.5. process_resident_memory_bytes - process_virtual_memory_bytes – detect swap usage. expired_keys – sudden spikes indicate mass expiration. latest_fork_usec – fork latency; alert if > 100 000 µs (100 ms).
Automate periodic --bigkeys scans, slow‑log length checks, and scheduled restarts for fragmentation control.
Tech Freedom Circle
Crazy Maker Circle (Tech Freedom Architecture Circle): a community of tech enthusiasts, experts, and high‑performance fans. Many top‑level masters, architects, and hobbyists have achieved tech freedom; another wave of go‑getters are hustling hard toward tech freedom.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
