Operations 12 min read

Why Redis Failed: Jedis Misconfigurations That Spark Service Avalanches

This article examines a Redis 3.x cluster failure caused by a master‑slave switch, detailing how improper Jedis timeout and retry settings triggered a service avalanche, and provides step‑by‑step analysis of the incident, code paths, and recommended configuration adjustments to prevent recurrence.

Efficient Ops

Oct 23, 2023

Why Redis Failed: Jedis Misconfigurations That Spark Service Avalanches

Background

Redis is the de‑facto remote‑cache solution for many Internet services, and Jedis is one of the most widely used Java clients. The author’s project runs Redis 3.x in cluster mode (multiple nodes with master‑slave pairs) and accesses it through Jedis.

A physical‑machine failure caused a node in the Redis cluster to perform a master‑slave switch. During the switch Jedis’ retry mechanism was triggered, eventually leading to a service‑wide avalanche.

Fault Record

Message‑queue backlog alert (timestamp 2022‑11‑29 23:50:21, queue size 159 412, threshold > 100 000).

System monitoring showed a sharp drop in request volume and average interface latency approaching 60 seconds.

Thread‑wait count surged dramatically during the incident.

Operations confirmed that a Redis master‑slave switch coincided with the outage.

Failure Analysis

Traffic Drop

NGINX logs contained many “connection timed out” errors, which caused NGINX to mark the backend as unavailable and emit “no live upstreams”. This prevented request forwarding and resulted in a steep traffic decline.

Latency Issue

Jedis threw connect timed out exceptions while acquiring connections. The default connection timeout is DEFAULT_TIMEOUT = 2000 ms. Each retry adds roughly 2 seconds; with six retries a single request can take about 12 seconds.

Because the business logic performs five Redis calls per request, the total latency can reach ~60 seconds, matching the observed average.

Jedis Execution Flow

The following diagram shows the high‑level flow of a Jedis command:

Key steps:

Obtain a connection based on the slot calculated by JedisClusterCRC16.getSlot.

Execute the command via execute.

If connection acquisition or command execution fails, an exception triggers the retry logic.

Source Code Highlights

public class JedisCluster extends BinaryJedisCluster implements JedisCommands, MultiKeyJedisClusterCommands, JedisClusterScriptingCommands {
    @Override
    public String set(final String key, final String value, final String nxxx, final String expx, final long time) {
        return new JedisClusterCommand<String>(connectionHandler, maxAttempts) {
            @Override
            public String execute(Jedis connection) {
                return connection.set(key, value, nxxx, expx, time);
            }
        }.run(key);
    }
}

The runWithRetries method encapsulates the retry logic, repeatedly calling connectionHandler.getConnectionFromSlot until the maximum attempt count is reached.

private T runWithRetries(byte[] key, int attempts, boolean tryRandomNode, boolean asking) {
    Jedis connection = null;
    try {
        // obtain connection
        connection = connectionHandler.getConnectionFromSlot(JedisClusterCRC16.getSlot(key));
        return execute(connection);
    } catch (JedisConnectionException e) {
        if (attempts <= 1) {
            connectionHandler.renewSlotCache();
            throw e;
        }
        return runWithRetries(key, attempts - 1, tryRandomNode, asking);
    } finally {
        releaseConnection(connection);
    }
}

Retry Mechanism

If an exception occurs while obtaining a connection or executing a command, Jedis retries up to maxAttempts times. In the incident the client performed six retries, each lasting about 2 seconds, leading to the observed 12‑second per‑request delay.

Recommendations

Configure Jedis parameters according to the actual workload:

maxAttempts : maximum number of retries.

connectionTimeout : connection timeout (e.g., 100 ms in production).

soTimeout : socket read timeout.

In the author’s production environment the settings are connectionTimeout = 100 ms, soTimeout = 100 ms, and maxAttempts = 2. With these values, even if a Redis node fails, a request typically completes within 1 second, preventing a cascade failure.

Conclusion

The analysis links a Redis master‑slave switch to a service avalanche caused by overly aggressive Jedis timeout and retry settings. Properly tuning connectionTimeout, soTimeout, and maxAttempts can dramatically reduce latency during node failures and keep the service stable.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Redis Jedis connection timeout retry mechanism Service Avalanche

Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.