Operations 12 min read

Why Redis Failed: Jedis Misconfigurations That Spark Service Avalanches

This article examines a Redis 3.x cluster failure caused by a master‑slave switch, detailing how improper Jedis timeout and retry settings triggered a service avalanche, and provides step‑by‑step analysis of the incident, code paths, and recommended configuration adjustments to prevent recurrence.

Efficient Ops
Efficient Ops
Efficient Ops
Why Redis Failed: Jedis Misconfigurations That Spark Service Avalanches

Background

Redis is the de‑facto remote‑cache solution for many Internet services, and Jedis is one of the most widely used Java clients. The author’s project runs Redis 3.x in cluster mode (multiple nodes with master‑slave pairs) and accesses it through Jedis.

A physical‑machine failure caused a node in the Redis cluster to perform a master‑slave switch. During the switch Jedis’ retry mechanism was triggered, eventually leading to a service‑wide avalanche.

Fault Record

Message‑queue backlog alert (timestamp 2022‑11‑29 23:50:21, queue size 159 412, threshold > 100 000).

System monitoring showed a sharp drop in request volume and average interface latency approaching 60 seconds.

Thread‑wait count surged dramatically during the incident.

Operations confirmed that a Redis master‑slave switch coincided with the outage.

Failure Analysis

Traffic Drop

NGINX logs contained many “connection timed out” errors, which caused NGINX to mark the backend as unavailable and emit “no live upstreams”. This prevented request forwarding and resulted in a steep traffic decline.

Latency Issue

Jedis threw

connect timed out

exceptions while acquiring connections. The default connection timeout is

DEFAULT_TIMEOUT = 2000 ms

. Each retry adds roughly 2 seconds; with six retries a single request can take about 12 seconds.

Because the business logic performs five Redis calls per request, the total latency can reach ~60 seconds, matching the observed average.

Jedis Execution Flow

The following diagram shows the high‑level flow of a Jedis command:

Key steps:

Obtain a connection based on the slot calculated by

JedisClusterCRC16.getSlot

.

Execute the command via

execute

.

If connection acquisition or command execution fails, an exception triggers the retry logic.

Source Code Highlights

<code>public class JedisCluster extends BinaryJedisCluster implements JedisCommands, MultiKeyJedisClusterCommands, JedisClusterScriptingCommands {
    @Override
    public String set(final String key, final String value, final String nxxx, final String expx, final long time) {
        return new JedisClusterCommand<String>(connectionHandler, maxAttempts) {
            @Override
            public String execute(Jedis connection) {
                return connection.set(key, value, nxxx, expx, time);
            }
        }.run(key);
    }
}</code>

The

runWithRetries

method encapsulates the retry logic, repeatedly calling

connectionHandler.getConnectionFromSlot

until the maximum attempt count is reached.

<code>private T runWithRetries(byte[] key, int attempts, boolean tryRandomNode, boolean asking) {
    Jedis connection = null;
    try {
        // obtain connection
        connection = connectionHandler.getConnectionFromSlot(JedisClusterCRC16.getSlot(key));
        return execute(connection);
    } catch (JedisConnectionException e) {
        if (attempts <= 1) {
            connectionHandler.renewSlotCache();
            throw e;
        }
        return runWithRetries(key, attempts - 1, tryRandomNode, asking);
    } finally {
        releaseConnection(connection);
    }
}</code>

Retry Mechanism

If an exception occurs while obtaining a connection or executing a command, Jedis retries up to

maxAttempts

times. In the incident the client performed six retries, each lasting about 2 seconds, leading to the observed 12‑second per‑request delay.

Recommendations

Configure Jedis parameters according to the actual workload:

maxAttempts : maximum number of retries.

connectionTimeout : connection timeout (e.g., 100 ms in production).

soTimeout : socket read timeout.

In the author’s production environment the settings are

connectionTimeout = 100 ms

,

soTimeout = 100 ms

, and

maxAttempts = 2

. With these values, even if a Redis node fails, a request typically completes within 1 second, preventing a cascade failure.

Conclusion

The analysis links a Redis master‑slave switch to a service avalanche caused by overly aggressive Jedis timeout and retry settings. Properly tuning

connectionTimeout

,

soTimeout

, and

maxAttempts

can dramatically reduce latency during node failures and keep the service stable.

BackendRedisJedisConnection Timeoutretry mechanismservice avalanche
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.