Why Redis Failed: Jedis Misconfigurations That Spark Service Avalanches
This article examines a Redis 3.x cluster failure caused by a master‑slave switch, detailing how improper Jedis timeout and retry settings triggered a service avalanche, and provides step‑by‑step analysis of the incident, code paths, and recommended configuration adjustments to prevent recurrence.
Background
Redis is the de‑facto remote‑cache solution for many Internet services, and Jedis is one of the most widely used Java clients. The author’s project runs Redis 3.x in cluster mode (multiple nodes with master‑slave pairs) and accesses it through Jedis.
A physical‑machine failure caused a node in the Redis cluster to perform a master‑slave switch. During the switch Jedis’ retry mechanism was triggered, eventually leading to a service‑wide avalanche.
Fault Record
Message‑queue backlog alert (timestamp 2022‑11‑29 23:50:21, queue size 159 412, threshold > 100 000).
System monitoring showed a sharp drop in request volume and average interface latency approaching 60 seconds.
Thread‑wait count surged dramatically during the incident.
Operations confirmed that a Redis master‑slave switch coincided with the outage.
Failure Analysis
Traffic Drop
NGINX logs contained many “connection timed out” errors, which caused NGINX to mark the backend as unavailable and emit “no live upstreams”. This prevented request forwarding and resulted in a steep traffic decline.
Latency Issue
Jedis threw
connect timed outexceptions while acquiring connections. The default connection timeout is
DEFAULT_TIMEOUT = 2000 ms. Each retry adds roughly 2 seconds; with six retries a single request can take about 12 seconds.
Because the business logic performs five Redis calls per request, the total latency can reach ~60 seconds, matching the observed average.
Jedis Execution Flow
The following diagram shows the high‑level flow of a Jedis command:
Key steps:
Obtain a connection based on the slot calculated by
JedisClusterCRC16.getSlot.
Execute the command via
execute.
If connection acquisition or command execution fails, an exception triggers the retry logic.
Source Code Highlights
<code>public class JedisCluster extends BinaryJedisCluster implements JedisCommands, MultiKeyJedisClusterCommands, JedisClusterScriptingCommands {
@Override
public String set(final String key, final String value, final String nxxx, final String expx, final long time) {
return new JedisClusterCommand<String>(connectionHandler, maxAttempts) {
@Override
public String execute(Jedis connection) {
return connection.set(key, value, nxxx, expx, time);
}
}.run(key);
}
}</code>The
runWithRetriesmethod encapsulates the retry logic, repeatedly calling
connectionHandler.getConnectionFromSlotuntil the maximum attempt count is reached.
<code>private T runWithRetries(byte[] key, int attempts, boolean tryRandomNode, boolean asking) {
Jedis connection = null;
try {
// obtain connection
connection = connectionHandler.getConnectionFromSlot(JedisClusterCRC16.getSlot(key));
return execute(connection);
} catch (JedisConnectionException e) {
if (attempts <= 1) {
connectionHandler.renewSlotCache();
throw e;
}
return runWithRetries(key, attempts - 1, tryRandomNode, asking);
} finally {
releaseConnection(connection);
}
}</code>Retry Mechanism
If an exception occurs while obtaining a connection or executing a command, Jedis retries up to
maxAttemptstimes. In the incident the client performed six retries, each lasting about 2 seconds, leading to the observed 12‑second per‑request delay.
Recommendations
Configure Jedis parameters according to the actual workload:
maxAttempts : maximum number of retries.
connectionTimeout : connection timeout (e.g., 100 ms in production).
soTimeout : socket read timeout.
In the author’s production environment the settings are
connectionTimeout = 100 ms,
soTimeout = 100 ms, and
maxAttempts = 2. With these values, even if a Redis node fails, a request typically completes within 1 second, preventing a cascade failure.
Conclusion
The analysis links a Redis master‑slave switch to a service avalanche caused by overly aggressive Jedis timeout and retry settings. Properly tuning
connectionTimeout,
soTimeout, and
maxAttemptscan dramatically reduce latency during node failures and keep the service stable.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.