How a Redis Connection Saturation Triggered a Service Avalanche – A Detailed Investigation
An online education platform experienced a massive outage when Redis hit its maxclients limit, causing authentication, session, and cache services to fail, which cascaded into a business avalanche; the article walks through the connection mechanism, root‑cause analysis, rapid mitigation steps, and long‑term safeguards.
Problem Background
At 10:00 AM on a weekday an online education platform received alerts: page load failures, homework submission timeouts, and Java applications repeatedly throwing
redis.clients.jedis.exceptions.JedisConnectionException: Could not get a resource from the pool. Redis monitoring showed connected_clients reached the maxclients=10000 limit and rejected_connections started appearing. The saturated Redis connections caused authentication, session and cache services to become unavailable, shifting traffic to remaining nodes and triggering a small‑scale business avalanche.
Redis Connection Mechanism Quick Overview
How Redis Handles Connections
Redis uses a single‑threaded event loop (network I/O can be multithreaded after version 6.0, but core processing remains single‑threaded). Each client occupies a file descriptor (fd) and events are processed via epoll or kqueue.
Key configuration:
maxclients 10000 # maximum connections (default 10000, Redis 2.4+)
timeout 0 # idle timeout (0 = never)
tcp-keepalive 300 # TCP keepalive intervalImpact of Reaching maxclients
When connected_clients equals maxclients, Redis stops accepting new connections and logs:
# Redis 9001 refused connection (maxclients)
-ERR max number of clients reachedAll services that rely on Redis are affected:
New requests cannot obtain a Redis connection → business threads block.
Blocked threads retry, filling the thread pool → web container runs out of threads.
Health checks fail, causing nodes to be removed from the service registry.
Traffic shifts to remaining nodes, whose Redis connections also surge, creating a cascading failure.
Investigation Process
Directly Check Redis Connection Status
# redis-cli -h <redis_host> -p 6379 -a <password> INFO clients
connected_clients:10000
client_longest_output_list:0
client_biggest_input_buf:0
blocked_clients:0
cluster_connections:0
maxclients:10000Both connected_clients and maxclients are at 10000, confirming saturation.
Examine Connection Distribution
# redis-cli CLIENT LIST
id=12345 addr=10.0.1.12:54321 fd=23 name= age=12345 idle=456 flags=N ...
id=12346 addr=10.0.1.13:54322 fd=24 name= age=12300 idle=12 flags=N ...Important fields: addr: client IP and port age: connection lifetime (seconds) idle: idle time (seconds) flags: N = normal client, M = master‑slave, S = slave
Count connections per IP:
# redis-cli CLIENT LIST | awk '{print $2}' | cut -d= -f2 | cut -d: -f1 | sort | uniq -c | sort -rn | head -10
3000 10.0.1.12
2800 10.0.1.14
1500 10.0.1.13
1200 10.0.1.15Find Zombie Connections
# Find connections idle > 300 s
$ redis-cli CLIENT LIST | awk -F'[ =]' '{for(i=1;i<=NF;i++) if($i=="idle") print $(i+1), $0}' | awk '$1 > 300' | head -20
# Simpler way
$ redis-cli CLIENT LIST | grep -E "idle=[0-9]{4,}"Many connections have idle values between 600 s and 3600 s, indicating long‑idle “zombie” connections. The timeout setting is 0 (never timeout).
Verify System‑Level Limits
# Redis process fd limit
$ cat /proc/$(pidof redis-server)/limits | grep "Max open files"
Max open files 10024 10024 files
# System fd limit
$ cat /proc/sys/fs/file-max
100000
# Current fd usage
$ cat /proc/sys/fs/file-nr
30000 0 100000The Redis process can open 10024 files, which is close to the maxclients of 10000 (Redis also needs a few fds for listening ports and AOF).
Application‑Layer Investigation
Inspect the Jedis connection‑pool configuration:
<bean id="jedisPoolConfig" class="redis.clients.jedis.JedisPoolConfig">
<property name="maxTotal" value="200"/> <!-- maximum connections per node -->
<property name="maxIdle" value="50"/> <!-- maximum idle connections -->
<property name="minIdle" value="10"/> <!-- minimum idle connections -->
<property name="maxWaitMillis" value="3000"/>
<property name="testOnBorrow" value="true"/>
<property name="testOnReturn" value="false"/>
<property name="timeBetweenEvictionRunsMillis" value="30000"/>
</bean>With 3000 application instances (15 nodes × maxTotal=200), connection reuse was insufficient, causing actual connections to far exceed expectations.
Root‑Cause Analysis
Direct Cause – Connection Surge After Deployment
The new version deployed at midnight set minIdle=10, causing each node to pre‑create 10 connections. Rolling update meant old and new nodes co‑existed briefly, doubling connections. Some old nodes did not release connections gracefully, turning them into zombies.
Amplifying Factor – timeout=0
With timeout=0, idle connections are never closed. Zombie connections (idle > 600 s) occupied slots, so when the morning peak arrived new requests immediately hit the limit.
Avalanche Chain
Redis connections full (maxclients=10000)
→ New request cannot get a connection → JedisConnectionException
→ No circuit‑breaker → threads block on retries
→ Web container thread pool exhausted → health‑check timeout
→ Service registry removes node → traffic shifts
→ Remaining nodes’ Redis connections surge → also hit limit
→ Entire service becomes unavailableRapid “Stop‑Bleeding” Solutions
Solution A – Temporarily Increase maxclients (preferred)
# Temporary change (lost after restart)
$ redis-cli CONFIG SET maxclients 20000When increasing maxclients, also raise the system file‑descriptor limits:
# Increase Redis process fd limit
$ prlimit --pid $(pidof redis-server) --nofile=30000
# Or adjust system‑wide limit
$ ulimit -n 65535Persist the change in /etc/security/limits.conf:
root soft nofile 65535
root hard nofile 65535
redis soft nofile 65535
redis hard nofile 65535Edit /etc/redis/redis.conf:
maxclients 20000Solution B – Enable timeout to Clean Up Zombies
# Set idle timeout to 60 s (effective immediately)
$ redis-cli CONFIG SET timeout 60Recommended timeout ranges:
Web apps + connection pool: timeout=60~300 Long‑living subscriptions/push: timeout=600~3600 Cache‑only usage:
timeout=60Solution C – Bulk Kill Abnormal Connections
# Kill all normal client connections (keep master‑slave links)
$ redis-cli CLIENT KILL TYPE normal
# Kill connections from a specific IP segment
$ redis-cli CLIENT KILL addr 10.0.1.12:0
# Disable skipme to kill the current connection if needed
$ redis-cli CLIENT KILL addr 10.0.1.12:0 skipme noSolution D – Restart Applications to Re‑initialize Pools
# Restart the application service (prefer rolling restart)
$ systemctl restart app-serviceAfter restart, connection counts quickly return to normal as minIdle connections are rebuilt.
Long‑Term Governance
Connection‑Pool Best Practices
<bean id="jedisPoolConfig" class="redis.clients.jedis.JedisPoolConfig">
<property name="maxTotal" value="50"/> <!-- per node, 20~100 recommended -->
<property name="maxIdle" value="20"/>
<property name="minIdle" value="5"/>
<property name="maxWaitMillis" value="2000"/> <!-- 1–3 s before throwing exception -->
<property name="testOnBorrow" value="false"/> <!-- use periodic idle testing -->
<property name="testWhileIdle" value="true"/>
<property name="timeBetweenEvictionRunsMillis" value="60000"/>
<property name="numTestsPerEvictionRun" value="-1"/>
<property name="blockWhenExhausted" value="true"/>
</bean>Key principles:
maxTotal should not be too large : 20–100 per node depending on concurrency; many idle connections waste resources.
Set maxWaitMillis : avoid indefinite thread blocking; 1000–3000 ms is typical.
Enable testWhileIdle : periodically verify idle connections are still alive.
Return connections properly : use try‑with‑resources (Jedis 3.x) or finally‑close (Jedis 2.x).
// Jedis 3.x recommended usage
try (Jedis jedis = jedisPool.getResource()) {
jedis.set("key", "value");
}
// Jedis 2.x must close explicitly
Jedis jedis = null;
try {
jedis = jedisPool.getResource();
jedis.set("key", "value");
} finally {
if (jedis != null) {
jedis.close(); // returns to pool, not a real close
}
}Use a Connection Proxy Layer
If many clients connect directly to Redis, introduce a proxy such as Twemproxy or Predixy:
[App instances] x 50 → [Twemproxy / Predixy] → [Redis master‑slave]Benefits:
Connection count reduces from M × N to M + N (M = app nodes, N = Redis nodes).
Proxy buffers bursty connection requests.
Supports read‑write splitting and failover.
Drawbacks:
Additional network hop adds 0.1–0.5 ms latency.
Proxy itself can become a bottleneck.
Connection Quotas and Firewall Protection
# Limit connections per application server (iptables)
$ iptables -A INPUT -p tcp --dport 6379 -m connlimit --connlimit-above 50 -j REJECTCircuit‑Breaker and Retry Back‑off
// Resilience4j CircuitBreaker example
CircuitBreakerConfig config = CircuitBreakerConfig.custom()
.failureRateThreshold(50) // trigger at 50% failures
.waitDurationInOpenState(Duration.ofSeconds(30))
.slidingWindowSize(10)
.build();
// Simple exponential back‑off
int baseDelay = 100; // ms
for (int i = 0; i < maxRetries; i++) {
try {
return jedisPool.getResource();
} catch (Exception e) {
Thread.sleep(baseDelay * (long)Math.pow(2, i));
}
}Monitoring and Alerting
Essential Redis connection metrics (exposed by redis_exporter for Prometheus):
redis_connected_clients redis_config_maxclients redis_rejected_connections_totalSuggested alert thresholds: connected_clients / maxclients > 80% → Warning connected_clients / maxclients > 90% → Critical rejected_connections > 0 → Emergency
Production‑Environment Precautions
When using CONFIG SET maxclients , adjust all related parameters : Redis maxclients, system ulimit -n, kernel fs.file-max, and /etc/security/limits.conf must be consistent.
Never modify timeout during peak hours : Changing from 0 to a small value (e.g., 30 s) can abruptly drop many connections, causing massive pool reconnections, CPU spikes and network jitter.
Be aware of CLIENT KILL skipme behavior : Default skipme=yes prevents killing the connection that issued the command.
Cloud Redis services may bind maxclients to instance size : Verify provider limits before requesting higher values.
Investigate connection‑pool leaks : If CLIENT LIST shows no obvious zombie connections, check that Jedis instances are not stored as member variables and that all exception paths close the resource.
Conclusion
The Redis connection‑saturation incident can be addressed in three layers:
First layer – Rapid stop‑bleeding :
CONFIG SET maxclients 20000 # raise limit
CONFIG SET timeout 60 # clean up zombies
CLIENT KILL TYPE normal # bulk kill abnormal connectionsSecond layer – Root‑cause investigation :
Check connection distribution → Analyze idle times → Review pool config → Verify system limitsThird layer – Architectural governance :
Standardize connection pools → Implement circuit‑breaker → Use proxy aggregation → Set up monitoring & alerts → Prepare emergency runbooksKey lessons learned: timeout=0 is high‑risk in production; set a reasonable idle timeout.
Configure minIdle and maxTotal based on actual concurrency, not arbitrarily.
Implement circuit‑breaker or degradation mechanisms to prevent single‑component failures from cascading.
Monitor connection counts during deployments and roll back or adjust immediately when anomalies appear.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
