Databases 22 min read

How to Diagnose and Fix Slow Redis Responses: A Step-by-Step Guide

This article walks through practical methods for troubleshooting slow service alerts, diagnosing Redis performance bottlenecks, and reproducing issues with local demos and load simulations, offering concrete metrics, command‑line checks, and mitigation strategies such as scaling, rate‑limiting, and pipeline optimization.

Efficient Ops
Efficient Ops
Efficient Ops
How to Diagnose and Fix Slow Redis Responses: A Step-by-Step Guide

01 First Key Point: Basic Service Troubleshooting Methods

When an alert appears at the end of the workday, the first step is to identify which link in the processing chain has become slow. In the example, module A’s B stage showed increased latency, prompting a reverse lookup to rule out inter‑module network or bandwidth issues.

Investigate the module A itself.

Examine data volume problems.

1.1 Check basic resource data of module A

Memory normal.

CPU normal.

1.2 Check node load

Node load is normal; a single service issue rarely occurs without node problems.

1.3 Check disk usage

Storage nodes are healthy.

1.4 Recent release?

No recent deployment, so the focus shifts to data volume.

2.1 Verify data volume increase

Report volume grew five‑fold; applying scaling, rate‑limiting, and service degradation resolved the issue.

02 Second Key Point: Redis Service Troubleshooting Basics

When Redis response times are slow, the investigation follows three layers: service‑level problems, data‑storage issues, and request‑side factors.

1. Confirm whether the issue is network latency on the Redis node

Network latency, packet loss, and OS scheduling can add hundreds of microseconds even on a 1 Gbit/s link.

In this case, data‑volume reduction also reduced latency, so network problems were ruled out.

2. Test intrinsic latency and latency history

redis-cli -h 127.0.0.1 -p 6379 --intrinsic-latency 60
redis-cli -h 127.0.0.1 -p 6379 --latency-history -i 1

Both commands showed no abnormal delays.

3. Examine throughput and replication metrics

# Total commands processed since last restart</code><code>total_commands_processed:2255</code><code># Instantaneous OPS</code><code>instantaneous_ops_per_sec:12</code><code># Network I/O</code><code>total_net_input_bytes:34312</code><code>total_net_output_bytes:78215</code><code># KB/s input/output</code><code>instantaneous_input_kbps:1.20</code><code>instantaneous_output_kbps:2.62

Metrics indicated normal throughput; replication was not in use.

4. Check memory‑related indicators

Key metrics such as used_memory_rss_human , used_memory_peak_human , and mem_fragmentation_ratio were inspected. No excessive fragmentation or memory pressure was found.

5. Investigate key‑space health

Key count (

info keyspace

) was within limits, and no big‑keys were present.

6. Look for hot‑key behavior

Hot‑key monitoring (available from Redis 5.0) revealed a few hot keys that caused CPU spikes without a corresponding increase in OPS.

7. Analyze CPU usage

Although Redis is single‑threaded, high CPU usage can stem from network I/O and hot‑key contention.

03 Third Key Point: Reproducing the Issue and Testing Basics

Two approaches are used: a local demo and an online simulation.

3.1 Local demo

Pipeline and Lua scripting were tested; perf showed fewer context switches when using pipeline.

3.2 Online simulation

Kafka data were replayed to generate load on service Y, which forwards traffic to Redis. Various scaling attempts (single producer, multiple partitions, multiple producers) were made until CPU and memory pressure on Redis matched the observed hot‑key pattern.

cat xxx-test | kaf produce kv__0.111 -n 10000 -b qapm-tencent-cp-kafka:9092
for i in {0..8}; do cat xxx-test | kaf produce kv__0.111 -n 10000 -p ${i} -b qapm-tencent-cp-kafka:9092; done

After adding coroutine workers to service Y, Redis CPU rose, confirming the hot‑key impact.

Mitigation

For multi‑instance deployments, use read/write separation.

For single‑instance setups, enable pipeline batch writes.

If pipeline is insufficient, add an application‑level cache.

Applying the third solution reduced latency and CPU usage dramatically.

monitoringPerformanceoperationsRedistroubleshooting
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.