Databases 23 min read

How to Diagnose and Optimize Slow Redis Access: A Step‑by‑Step Guide

This article walks through a real‑world Redis latency incident, detailing systematic troubleshooting steps, key metrics to monitor, command‑line diagnostics, and practical optimizations such as scaling, pipelining, and hot‑key mitigation to restore service performance.

dbaplus Community
dbaplus Community
dbaplus Community
How to Diagnose and Optimize Slow Redis Access: A Step‑by‑Step Guide

1. Basic Service Troubleshooting

When a latency alarm appears for a module that accesses Redis, the first step is to locate the slow link in the processing chain. In the example the bottleneck was the B stage of module A. Two parallel investigation paths are recommended:

Check the module itself

Check data volume

For the module‑centric check, examine the basic resource metrics:

CPU and memory usage of the service.

Node load – a healthy node should not cause a single‑service slowdown.

Disk usage on the storage node.

If all indicators (CPU, memory, network I/O, disk I/O) are normal and no recent deployment occurred, move to the Redis layer.

2. Redis Service Troubleshooting

2.1 Network latency

Measure the round‑trip latency between the application server and the Redis node. Typical values are ~200 µs for TCP/IP on a 1 Gbps network and ~30 µs for Unix‑domain sockets. Consider OS scheduling, NUMA effects, and virtualization overhead.

2.2 Redis internal health checks

# Intrinsic latency test (single‑shot)
redis-cli -h 127.0.0.1 -p 6379 --intrinsic-latency 60

# Latency history (averaged over time)
redis-cli -h 127.0.0.1 -p 6379 --latency-history -i 1

# Throughput statistics
redis-cli -h 127.0.0.1 -p 6379 info stats

Key metrics to review:

total_commands_processed
instantaneous_ops_per_sec
total_net_input_bytes

/

total_net_output_bytes
instantaneous_input_kbps

/ instantaneous_output_kbps Even though Redis processes commands in sub‑microsecond time, network latency often dominates the end‑to‑end response time.

2.3 Data‑storage checks

Inspect key‑space size: redis-cli -h 127.0.0.1 -p 6379 info keyspace Best practice is to keep a single instance under ~10 k keys. Detect oversized keys with:

redis-cli -h 127.0.0.1 -p 6379 --bigkeys -i 0.01

Evaluate memory fragmentation:

# fragmentation ratio = used_memory_rss / used_memory
# Values > 1.5 indicate excessive fragmentation

2.4 Request‑side analysis

Client connections and blocked clients: redis-cli -h 127.0.0.1 -p 6379 client list Typical output shows ~430 connections and no blocked clients.

Slow‑command log (default threshold 1000 ms): redis-cli -h 127.0.0.1 -p 6379 SLOWLOG GET Cache‑miss rate can be read from info stats, but was not a factor in the case study.

Hot‑key detection (Redis ≥ 4.0.3) requires the eviction policy to be set to allkeys-lfu (or volatile-lfu). After enabling the policy, redis-cli --latency‑histogram or custom scripts can reveal keys with disproportionate access frequency. The investigation identified a hot key that caused high CPU usage while overall OPS remained modest.

2.5 Architectural insights

Redis uses a single‑threaded event loop because CPU is rarely the bottleneck; network I/O and memory bandwidth dominate. To exploit multiple cores, operators can:

Run multiple Redis instances (sharding).

Deploy a Redis cluster.

Enable multi‑threaded I/O (Redis 6.0+).

Use DPDK or kernel‑bypass networking for ultra‑low latency.

Reduce round‑trips with pipelines or Lua scripts.

3. Reproducing the Issue and Validation

Two validation approaches were used:

Local demo : Implement pipelines and measure context‑switch count with perf. Pipelines reduce the number of system calls and context switches.

Online simulation : Replay Kafka data to stress the service chain. Example commands:

# Produce 10 000 messages to a single partition
echo "test" | kaf produce kv__0.111 -n 10000 -b qapm-tencent-cp-kafka:9092
# Produce 10 000 messages to each of 9 partitions
for i in {0..8}; do echo "test" | kaf produce kv__0.111 -n 10000 -p $i -b qapm-tencent-cp-kafka:9092; done

Initial pressure was insufficient; deploying the producer on every Kafka broker finally generated enough load to observe CPU and memory spikes on the Redis side.

After confirming the hot‑key as the root cause, three mitigation strategies were applied:

Read‑write splitting across multiple Redis instances (classic master‑slave or cluster read‑only replicas).

Pipeline batch writes for a single‑instance deployment.

Introduce an additional caching layer (e.g., local in‑process cache) when pipelines alone cannot meet latency requirements.

Implementing the extra cache layer reduced both latency and CPU consumption, confirming the effectiveness of hot‑key mitigation.

4. Key Commands and Metrics Summary

# Intrinsic latency test
redis-cli --intrinsic-latency 60
# Latency history
redis-cli --latency-history -i 1
# Throughput and stats
redis-cli info stats
# Keyspace size
redis-cli info keyspace
# Big‑key scan
redis-cli --bigkeys -i 0.01
# Client list
redis-cli client list
# Slowlog
redis-cli SLOWLOG GET

Monitoring these indicators together provides a systematic method to isolate performance regressions in Redis‑backed services.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

performanceredistroubleshooting
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.