Databases 42 min read

Why Is My Redis Slowing Down? A Complete Troubleshooting Guide

This article provides a systematic, step‑by‑step methodology for diagnosing Redis latency spikes, covering baseline performance testing, slow‑log analysis, high‑complexity commands, big‑key handling, expiration patterns, memory limits, fork overhead, huge‑page settings, AOF configurations, CPU binding, swap usage, memory fragmentation, network saturation, and practical monitoring tips.

Cloud Native Technology Community
Cloud Native Technology Community
Cloud Native Technology Community
Why Is My Redis Slowing Down? A Complete Troubleshooting Guide

Verify Whether Redis Is Slow

First isolate the latency increase to the Redis service. Use distributed tracing in the business service to measure each hop. Then benchmark the Redis instance directly on its host to obtain a baseline latency.

$ redis-cli -h 127.0.0.1 -p 6379 --intrinsic-latency 60

This command reports the maximum latency observed during the 60‑second window (e.g., 72 µs). For a continuous view you can run:

$ redis-cli -h 127.0.0.1 -p 6379 --latency-history -i 1

Typical average latency on a healthy instance is 0.08 ms–0.13 ms. Consider the instance slow if its latency is more than twice the baseline measured on a comparable server.

Analyze Slow‑Log

Enable the slow‑log with a threshold that matches your latency tolerance (e.g., 5 ms) and keep enough entries for analysis:

# Record commands slower than 5 ms
CONFIG SET slowlog-log-slower-than 5000
# Keep the latest 500 entries
CONFIG SET slowlog-max-len 500

Query recent entries: 127.0.0.1:6379> SLOWLOG GET 5 Inspect the command name, arguments, and execution time (in microseconds) to spot expensive operations.

High‑Complexity Commands

Commands whose algorithmic complexity is O(N) or higher (e.g., SORT , SUNION , ZUNIONSTORE ) or commands with a very large N can consume excessive CPU and increase network payload.

CPU usage rises because the server must process more elements.

Large result sets increase protocol overhead.

Mitigation:

Avoid such commands in hot paths; move aggregation to the client.

If they must be used, keep N small (recommended N ≤ 300) and request only the needed fields.

Big‑Key Problem

A key whose value is very large (a “bigkey”) incurs high allocation cost on write and high deallocation cost on delete.

Detect bigkeys with the built‑in scanner:

$ redis-cli -h 127.0.0.1 -p 6379 --bigkeys -i 0.01

The output lists the largest key per data type and overall memory distribution.

Remediation:

Split oversized values into multiple smaller keys.

For Redis ≥ 4.0 use UNLINK instead of DEL to free memory asynchronously.

For Redis ≥ 6.0 enable lazyfree-lazy-user-del yes to perform deletion in background threads.

Concentrated Expiration

When many keys expire at the same timestamp (e.g., via EXPIREAT ), Redis’s active expiration task runs on the main thread and can block client requests, causing latency spikes that do not appear in the slow‑log.

Mitigation:

Randomize expiration times so that deletions are spread over time.

For Redis ≥ 4.0 enable lazy‑free expiration with lazyfree-lazy-expire yes.

Memory Limit and Eviction Policies

If maxmemory is reached, Redis evicts keys before accepting new writes, adding latency. Common policies:

allkeys-lru – evict the least‑recently‑used key regardless of expiration.

volatile-lru – evict the least‑recently‑used key that has an expiration.

allkeys-random , volatile-random , allkeys-ttl , noeviction , allkeys-lfu , volatile-lfu .

Select a policy that matches the workload; allkeys-lru and volatile-lru are the most frequently used.

Fork Overhead (RDB / AOF Rewrite)

Background persistence creates a child process via fork(). The parent must copy its page tables, which can take seconds for large instances and block the main thread.

Check the duration of the last fork: # latest_fork_usec:59477 # microseconds If the value is large, consider reducing instance size, scheduling persistence during off‑peak hours, or disabling AOF rewrite on replicas.

Transparent Huge Pages

Linux transparent huge pages allocate memory in 2 MiB chunks. While they reduce the number of allocations, each allocation takes longer, increasing write latency.

Check the setting:

$ cat /sys/kernel/mm/transparent_hugepage/enabled

If the output shows [always], disable it:

$ echo never > /sys/kernel/mm/transparent_hugepage/enabled

AOF Persistence Impact

Redis supports three appendfsync policies:

always – every write is flushed to disk; highest durability but highest latency.

no – writes stay in OS buffers; lowest latency but risk of data loss.

everysec – a background thread fsyncs once per second; a balance.

During an AOF rewrite, you can temporarily disable fsync to avoid blocking the main thread:

# Disable AOF fsync during rewrite
no-appendfsync-on-rewrite yes

Be aware that this increases the risk of data loss if a crash occurs during the rewrite.

CPU Binding

Binding the Redis process to a single logical core can cause the forked persistence processes to compete for the same CPU, worsening latency.

For Redis 6.0+ you can bind different thread groups to separate core sets:

# Server and I/O threads on cores 0,2,4,6
server_cpulist 0-7:2
# Background I/O threads on cores 1,3
bio_cpulist 1,3
# AOF rewrite on cores 8‑11
aof_rewrite_cpulist 8-11
# RDB save on cores 1,10,11
bgsave_cpulist 1,10-11

Apply CPU binding only after understanding the server’s architecture and workload.

Swap Usage

If Redis starts swapping, latency can increase to hundreds of milliseconds because memory pages are read from disk.

Check swap usage for the Redis process:

# Find Redis PID
ps -aux | grep redis-server
# Inspect swap per memory region (replace $pid with the actual PID)
cat /proc/$pid/smaps | egrep '^(Swap|Size)'

If a significant amount of memory is swapped, add RAM or free memory and restart the instance (preferably after a master‑slave switchover).

Memory Fragmentation

Fragmentation ratio = used_memory_rss / used_memory. A ratio > 1.5 indicates > 50 % fragmentation.

Mitigation:

Redis < 4.0 – restart the instance.

Redis ≥ 4.0 – enable active defragmentation:

activedefrag yes
active-defrag-ignore-bytes 100mb
active-defrag-threshold-lower 10
active-defrag-threshold-upper 100
active-defrag-cycle-min 1
active-defrag-cycle-max 25
active-defrag-max-scan-fields 1000

Test the impact before enabling in production because defragmentation consumes CPU.

Network Bandwidth Saturation

When a Redis instance consumes the entire network bandwidth of its host, TCP retransmissions and packet loss increase latency. Monitor network traffic and scale out or migrate heavy instances.

Practical Operational Tips

Prefer long‑lived connections; avoid frequent short connections that add TCP handshake overhead.

Collect comprehensive metrics via INFO (e.g., expired_keys, latency, CPU, memory, network) and set alerts.

Ensure monitoring tools themselves use persistent connections and reasonable polling intervals.

Run Redis on dedicated servers; avoid co‑locating unrelated workloads that compete for CPU, memory, or disk.

Conclusion

Redis latency is affected by command complexity, key size, expiration patterns, memory limits, persistence mechanisms, OS‑level features (transparent huge pages, swap, NUMA), CPU binding, and network conditions. By following the systematic checks above—baseline latency testing, slow‑log analysis, big‑key detection, expiration randomization, appropriate eviction policy, fork‑time monitoring, disabling huge pages, tuning AOF fsync, careful CPU binding, avoiding swap, and managing fragmentation and bandwidth—you can quickly identify the root cause of latency spikes and apply targeted optimizations.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

monitoringperformanceoptimizationdatabaseredisLatencytroubleshooting
Cloud Native Technology Community
Written by

Cloud Native Technology Community

The Cloud Native Technology Community, part of the CNBPA Cloud Native Technology Practice Alliance, focuses on evangelizing cutting‑edge cloud‑native technologies and practical implementations. It shares in‑depth content, case studies, and event/meetup information on containers, Kubernetes, DevOps, Service Mesh, and other cloud‑native tech, along with updates from the CNBPA alliance.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.