Databases 20 min read

How a Redis Connection Saturation Triggered a Service Avalanche – A Detailed Investigation

An online education platform experienced a massive outage when Redis hit its maxclients limit, causing authentication, session, and cache services to fail, which cascaded into a business avalanche; the article walks through the connection mechanism, root‑cause analysis, rapid mitigation steps, and long‑term safeguards.

MaGe Linux Operations
MaGe Linux Operations
MaGe Linux Operations
How a Redis Connection Saturation Triggered a Service Avalanche – A Detailed Investigation

Problem Background

At 10:00 AM on a weekday an online education platform received alerts: page load failures, homework submission timeouts, and Java applications repeatedly throwing

redis.clients.jedis.exceptions.JedisConnectionException: Could not get a resource from the pool

. Redis monitoring showed connected_clients reached the maxclients=10000 limit and rejected_connections started appearing. The saturated Redis connections caused authentication, session and cache services to become unavailable, shifting traffic to remaining nodes and triggering a small‑scale business avalanche.

Redis Connection Mechanism Quick Overview

How Redis Handles Connections

Redis uses a single‑threaded event loop (network I/O can be multithreaded after version 6.0, but core processing remains single‑threaded). Each client occupies a file descriptor (fd) and events are processed via epoll or kqueue.

Key configuration:

maxclients 10000   # maximum connections (default 10000, Redis 2.4+)
timeout 0          # idle timeout (0 = never)
tcp-keepalive 300 # TCP keepalive interval

Impact of Reaching maxclients

When connected_clients equals maxclients, Redis stops accepting new connections and logs:

# Redis 9001  refused connection (maxclients)
-ERR max number of clients reached

All services that rely on Redis are affected:

New requests cannot obtain a Redis connection → business threads block.

Blocked threads retry, filling the thread pool → web container runs out of threads.

Health checks fail, causing nodes to be removed from the service registry.

Traffic shifts to remaining nodes, whose Redis connections also surge, creating a cascading failure.

Investigation Process

Directly Check Redis Connection Status

# redis-cli -h <redis_host> -p 6379 -a <password> INFO clients
connected_clients:10000
client_longest_output_list:0
client_biggest_input_buf:0
blocked_clients:0
cluster_connections:0
maxclients:10000

Both connected_clients and maxclients are at 10000, confirming saturation.

Examine Connection Distribution

# redis-cli CLIENT LIST
id=12345 addr=10.0.1.12:54321 fd=23 name= age=12345 idle=456 flags=N ...
id=12346 addr=10.0.1.13:54322 fd=24 name= age=12300 idle=12 flags=N ...

Important fields: addr: client IP and port age: connection lifetime (seconds) idle: idle time (seconds) flags: N = normal client, M = master‑slave, S = slave

Count connections per IP:

# redis-cli CLIENT LIST | awk '{print $2}' | cut -d= -f2 | cut -d: -f1 | sort | uniq -c | sort -rn | head -10
   3000 10.0.1.12
   2800 10.0.1.14
   1500 10.0.1.13
   1200 10.0.1.15

Find Zombie Connections

# Find connections idle > 300 s
$ redis-cli CLIENT LIST | awk -F'[ =]' '{for(i=1;i<=NF;i++) if($i=="idle") print $(i+1), $0}' | awk '$1 > 300' | head -20
# Simpler way
$ redis-cli CLIENT LIST | grep -E "idle=[0-9]{4,}"

Many connections have idle values between 600 s and 3600 s, indicating long‑idle “zombie” connections. The timeout setting is 0 (never timeout).

Verify System‑Level Limits

# Redis process fd limit
$ cat /proc/$(pidof redis-server)/limits | grep "Max open files"
Max open files            10024            10024            files
# System fd limit
$ cat /proc/sys/fs/file-max
100000
# Current fd usage
$ cat /proc/sys/fs/file-nr
30000   0   100000

The Redis process can open 10024 files, which is close to the maxclients of 10000 (Redis also needs a few fds for listening ports and AOF).

Application‑Layer Investigation

Inspect the Jedis connection‑pool configuration:

<bean id="jedisPoolConfig" class="redis.clients.jedis.JedisPoolConfig">
    <property name="maxTotal" value="200"/>   <!-- maximum connections per node -->
    <property name="maxIdle" value="50"/>    <!-- maximum idle connections -->
    <property name="minIdle" value="10"/>    <!-- minimum idle connections -->
    <property name="maxWaitMillis" value="3000"/>
    <property name="testOnBorrow" value="true"/>
    <property name="testOnReturn" value="false"/>
    <property name="timeBetweenEvictionRunsMillis" value="30000"/>
</bean>

With 3000 application instances (15 nodes × maxTotal=200), connection reuse was insufficient, causing actual connections to far exceed expectations.

Root‑Cause Analysis

Direct Cause – Connection Surge After Deployment

The new version deployed at midnight set minIdle=10, causing each node to pre‑create 10 connections. Rolling update meant old and new nodes co‑existed briefly, doubling connections. Some old nodes did not release connections gracefully, turning them into zombies.

Amplifying Factor – timeout=0

With timeout=0, idle connections are never closed. Zombie connections (idle > 600 s) occupied slots, so when the morning peak arrived new requests immediately hit the limit.

Avalanche Chain

Redis connections full (maxclients=10000)
 → New request cannot get a connection → JedisConnectionException
 → No circuit‑breaker → threads block on retries
 → Web container thread pool exhausted → health‑check timeout
 → Service registry removes node → traffic shifts
 → Remaining nodes’ Redis connections surge → also hit limit
 → Entire service becomes unavailable

Rapid “Stop‑Bleeding” Solutions

Solution A – Temporarily Increase maxclients (preferred)

# Temporary change (lost after restart)
$ redis-cli CONFIG SET maxclients 20000

When increasing maxclients, also raise the system file‑descriptor limits:

# Increase Redis process fd limit
$ prlimit --pid $(pidof redis-server) --nofile=30000
# Or adjust system‑wide limit
$ ulimit -n 65535

Persist the change in /etc/security/limits.conf:

root    soft    nofile  65535
root    hard    nofile  65535
redis   soft    nofile  65535
redis   hard    nofile  65535

Edit /etc/redis/redis.conf:

maxclients 20000

Solution B – Enable timeout to Clean Up Zombies

# Set idle timeout to 60 s (effective immediately)
$ redis-cli CONFIG SET timeout 60

Recommended timeout ranges:

Web apps + connection pool: timeout=60~300 Long‑living subscriptions/push: timeout=600~3600 Cache‑only usage:

timeout=60

Solution C – Bulk Kill Abnormal Connections

# Kill all normal client connections (keep master‑slave links)
$ redis-cli CLIENT KILL TYPE normal
# Kill connections from a specific IP segment
$ redis-cli CLIENT KILL addr 10.0.1.12:0
# Disable skipme to kill the current connection if needed
$ redis-cli CLIENT KILL addr 10.0.1.12:0 skipme no

Solution D – Restart Applications to Re‑initialize Pools

# Restart the application service (prefer rolling restart)
$ systemctl restart app-service

After restart, connection counts quickly return to normal as minIdle connections are rebuilt.

Long‑Term Governance

Connection‑Pool Best Practices

<bean id="jedisPoolConfig" class="redis.clients.jedis.JedisPoolConfig">
    <property name="maxTotal" value="50"/>   <!-- per node, 20~100 recommended -->
    <property name="maxIdle" value="20"/>
    <property name="minIdle" value="5"/>
    <property name="maxWaitMillis" value="2000"/>   <!-- 1–3 s before throwing exception -->
    <property name="testOnBorrow" value="false"/>   <!-- use periodic idle testing -->
    <property name="testWhileIdle" value="true"/>
    <property name="timeBetweenEvictionRunsMillis" value="60000"/>
    <property name="numTestsPerEvictionRun" value="-1"/>
    <property name="blockWhenExhausted" value="true"/>
</bean>

Key principles:

maxTotal should not be too large : 20–100 per node depending on concurrency; many idle connections waste resources.

Set maxWaitMillis : avoid indefinite thread blocking; 1000–3000 ms is typical.

Enable testWhileIdle : periodically verify idle connections are still alive.

Return connections properly : use try‑with‑resources (Jedis 3.x) or finally‑close (Jedis 2.x).

// Jedis 3.x recommended usage
try (Jedis jedis = jedisPool.getResource()) {
    jedis.set("key", "value");
}
// Jedis 2.x must close explicitly
Jedis jedis = null;
try {
    jedis = jedisPool.getResource();
    jedis.set("key", "value");
} finally {
    if (jedis != null) {
        jedis.close(); // returns to pool, not a real close
    }
}

Use a Connection Proxy Layer

If many clients connect directly to Redis, introduce a proxy such as Twemproxy or Predixy:

[App instances] x 50 → [Twemproxy / Predixy] → [Redis master‑slave]

Benefits:

Connection count reduces from M × N to M + N (M = app nodes, N = Redis nodes).

Proxy buffers bursty connection requests.

Supports read‑write splitting and failover.

Drawbacks:

Additional network hop adds 0.1–0.5 ms latency.

Proxy itself can become a bottleneck.

Connection Quotas and Firewall Protection

# Limit connections per application server (iptables)
$ iptables -A INPUT -p tcp --dport 6379 -m connlimit --connlimit-above 50 -j REJECT

Circuit‑Breaker and Retry Back‑off

// Resilience4j CircuitBreaker example
CircuitBreakerConfig config = CircuitBreakerConfig.custom()
    .failureRateThreshold(50) // trigger at 50% failures
    .waitDurationInOpenState(Duration.ofSeconds(30))
    .slidingWindowSize(10)
    .build();

// Simple exponential back‑off
int baseDelay = 100; // ms
for (int i = 0; i < maxRetries; i++) {
    try {
        return jedisPool.getResource();
    } catch (Exception e) {
        Thread.sleep(baseDelay * (long)Math.pow(2, i));
    }
}

Monitoring and Alerting

Essential Redis connection metrics (exposed by redis_exporter for Prometheus):

redis_connected_clients
redis_config_maxclients
redis_rejected_connections_total

Suggested alert thresholds: connected_clients / maxclients > 80% → Warning connected_clients / maxclients > 90% → Critical rejected_connections > 0 → Emergency

Production‑Environment Precautions

When using CONFIG SET maxclients , adjust all related parameters : Redis maxclients, system ulimit -n, kernel fs.file-max, and /etc/security/limits.conf must be consistent.

Never modify timeout during peak hours : Changing from 0 to a small value (e.g., 30 s) can abruptly drop many connections, causing massive pool reconnections, CPU spikes and network jitter.

Be aware of CLIENT KILL skipme behavior : Default skipme=yes prevents killing the connection that issued the command.

Cloud Redis services may bind maxclients to instance size : Verify provider limits before requesting higher values.

Investigate connection‑pool leaks : If CLIENT LIST shows no obvious zombie connections, check that Jedis instances are not stored as member variables and that all exception paths close the resource.

Conclusion

The Redis connection‑saturation incident can be addressed in three layers:

First layer – Rapid stop‑bleeding :

CONFIG SET maxclients 20000   # raise limit
CONFIG SET timeout 60       # clean up zombies
CLIENT KILL TYPE normal     # bulk kill abnormal connections

Second layer – Root‑cause investigation :

Check connection distribution → Analyze idle times → Review pool config → Verify system limits

Third layer – Architectural governance :

Standardize connection pools → Implement circuit‑breaker → Use proxy aggregation → Set up monitoring & alerts → Prepare emergency runbooks

Key lessons learned: timeout=0 is high‑risk in production; set a reasonable idle timeout.

Configure minIdle and maxTotal based on actual concurrency, not arbitrarily.

Implement circuit‑breaker or degradation mechanisms to prevent single‑component failures from cascading.

Monitor connection counts during deployments and roll back or adjust immediately when anomalies appear.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

MonitoringPerformanceOperationsRedisConnection PoolJedisTroubleshooting
MaGe Linux Operations
Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.