How to Diagnose and Fix Intermittent Timeout & READONLY Errors in Redis Clusters
This guide explains why a production Redis cluster may show intermittent timeouts and READONLY errors, analyzes root causes such as network jitter, resource saturation, and costly commands, and provides step‑by‑step troubleshooting and configuration fixes to restore stability and performance.
Problem Description
In a production environment a Redis cluster exhibited intermittent timeout and high latency, and some applications reported READONLY You can't write against a read‑only replica errors.
Problem Analysis
Observed Symptoms
Timeout waiting for connection
READONLY error
Master node CPU usage near 100%
Inter‑node communication latency spikes
Slow‑query log contains many commands
Initial Suspected Causes
Network jitter causing failover, sending requests to read‑only replicas
Improper client connection‑pool configuration or exhausted connections
Application misuse of high‑cost commands such as KEYS and HGETALL Insufficient node memory or hardware issues (e.g., disk performance)
Investigation Steps
(1) Examine Redis Logs
Frequent LOADING Redis is loading the dataset in memory indicates nodes restarting and loading data.
Intermittent FAILOVER entries show master‑slave role switches.
Numerous CLIENT_KILL logs reveal Redis actively terminating client connections.
(2) Check Network Conditions
Use ping and mtr to test latency between application servers and Redis nodes; observed large latency fluctuations.
Some nodes’ INFO replication output shows master_link_down_since_seconds > 0, indicating broken master‑slave links.
(3) Review Slow‑Query Log
High proportion of KEYS commands in the slow log. HGETALL on certain hashes returned over 10,000 rows.
(4) Inspect Node Resource Usage
Memory usage on several nodes approaches limits, triggering eviction.
Master CPU usage spikes to 95‑100% due to operations on large keys.
Fixes and Optimizations
1. Network Optimization
Move the Redis cluster to a low‑latency, stable private network.
Enable TCP keepalive to reduce disconnections caused by network jitter.
tcp-keepalive 602. Client Optimization
Increase the maximum number of connections in the client pool to reduce timeout waiting for connection errors.
Use a client that supports Redis Sentinel or Cluster mode for automatic master‑slave detection.
3. Application‑Level Optimization
Replace KEYS with SCAN to fetch keys in batches and lower master load.
For HGETALL, fetch only required fields using HMGET.
Split large hashes into multiple smaller hashes to avoid big‑key slow queries.
4. Redis Configuration Tuning
Raise slow‑query threshold and enable alerts.
slowlog-log-slower-than 10000 # microseconds</code><code>slowlog-max-len 128Set an appropriate eviction policy, e.g., allkeys-lru. maxmemory-policy allkeys-lru Limit client output buffer to control traffic spikes.
5. Master‑Slave Synchronization Fixes
Upgrade disk hardware (e.g., replace with SSD) to improve replication speed.
Increase repl-backlog-size to enlarge the replication buffer and reduce full‑sync occurrences.
repl-backlog-size 128mbConclusion
Redis issues in production are typically a mix of network instability, resource constraints, and suboptimal command usage.
By applying proper configuration tweaks, adjusting application behavior, and monitoring resources, teams can efficiently diagnose and resolve middleware problems.
Full-Stack DevOps & Kubernetes
Focused on sharing DevOps, Kubernetes, Linux, Docker, Istio, microservices, Spring Cloud, Python, Go, databases, Nginx, Tomcat, cloud computing, and related technologies.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
