Databases 7 min read

How to Diagnose and Fix Intermittent Timeout & READONLY Errors in Redis Clusters

This guide explains why a production Redis cluster may show intermittent timeouts and READONLY errors, analyzes root causes such as network jitter, resource saturation, and costly commands, and provides step‑by‑step troubleshooting and configuration fixes to restore stability and performance.

Full-Stack DevOps & Kubernetes

Dec 25, 2024

How to Diagnose and Fix Intermittent Timeout & READONLY Errors in Redis Clusters

Problem Description

In a production environment a Redis cluster exhibited intermittent timeout and high latency, and some applications reported READONLY You can't write against a read‑only replica errors.

Problem Analysis

Observed Symptoms

Timeout waiting for connection

READONLY error

Master node CPU usage near 100%

Inter‑node communication latency spikes

Slow‑query log contains many commands

Initial Suspected Causes

Network jitter causing failover, sending requests to read‑only replicas

Improper client connection‑pool configuration or exhausted connections

Application misuse of high‑cost commands such as KEYS and HGETALL Insufficient node memory or hardware issues (e.g., disk performance)

Investigation Steps

(1) Examine Redis Logs

Frequent LOADING Redis is loading the dataset in memory indicates nodes restarting and loading data.

Intermittent FAILOVER entries show master‑slave role switches.

Numerous CLIENT_KILL logs reveal Redis actively terminating client connections.

(2) Check Network Conditions

Use ping and mtr to test latency between application servers and Redis nodes; observed large latency fluctuations.

Some nodes’ INFO replication output shows master_link_down_since_seconds > 0, indicating broken master‑slave links.

(3) Review Slow‑Query Log

High proportion of KEYS commands in the slow log. HGETALL on certain hashes returned over 10,000 rows.

(4) Inspect Node Resource Usage

Memory usage on several nodes approaches limits, triggering eviction.

Master CPU usage spikes to 95‑100% due to operations on large keys.

Fixes and Optimizations

1. Network Optimization

Move the Redis cluster to a low‑latency, stable private network.

Enable TCP keepalive to reduce disconnections caused by network jitter.

tcp-keepalive 60

2. Client Optimization

Increase the maximum number of connections in the client pool to reduce timeout waiting for connection errors.

Use a client that supports Redis Sentinel or Cluster mode for automatic master‑slave detection.

3. Application‑Level Optimization

Replace KEYS with SCAN to fetch keys in batches and lower master load.

For HGETALL, fetch only required fields using HMGET.

Split large hashes into multiple smaller hashes to avoid big‑key slow queries.

4. Redis Configuration Tuning

Raise slow‑query threshold and enable alerts.

slowlog-log-slower-than 10000  # microseconds</code><code>slowlog-max-len 128

Set an appropriate eviction policy, e.g., allkeys-lru. maxmemory-policy allkeys-lru Limit client output buffer to control traffic spikes.

5. Master‑Slave Synchronization Fixes

Upgrade disk hardware (e.g., replace with SSD) to improve replication speed.

Increase repl-backlog-size to enlarge the replication buffer and reduce full‑sync occurrences.

repl-backlog-size 128mb

Conclusion

Redis issues in production are typically a mix of network instability, resource constraints, and suboptimal command usage.

By applying proper configuration tweaks, adjusting application behavior, and monitoring resources, teams can efficiently diagnose and resolve middleware problems.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

performance Database Redis cluster troubleshooting

Written by

Full-Stack DevOps & Kubernetes

Focused on sharing DevOps, Kubernetes, Linux, Docker, Istio, microservices, Spring Cloud, Python, Go, databases, Nginx, Tomcat, cloud computing, and related technologies.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.