Databases 11 min read

Why Did Redis Connections Time Out? A Deep Dive into AOF, RDB, and Disk I/O

This article walks through a production Redis connection timeout incident, detailing how concurrent AOF persistence and RDB snapshots caused disk I/O blockage, the diagnostic steps taken, and the optimization measures implemented to eliminate the timeouts and improve performance.

IT Services Circle
IT Services Circle
IT Services Circle
Why Did Redis Connections Time Out? A Deep Dive into AOF, RDB, and Disk I/O
Abstract : This article provides a detailed post‑mortem of a Redis connection‑timeout outage in production. Systematic problem identification, root‑cause analysis, and remediation revealed that concurrent AOF persistence and RDB snapshots caused disk I/O blockage. The write‑up offers the full troubleshooting workflow, technical analysis, and optimization tactics for similar issues.

1. Problem Background and Symptoms

1.1 Issue Overview

One morning the monitoring system generated a flood of alerts indicating an abnormal rise in interface response timeout rates. Application logs showed many Redis connection‑timeout exceptions with a typical stack trace.

redis.clients.jedis.exceptions.JedisConnectionException:
    java.net.SocketTimeoutException: Read timed out
    at redis.clients.jedis.util.RedisInputStream.ensureFill(RedisInputStream.java:205)
    at redis.clients.jedis.util.RedisInputStream.readByte(RedisInputStream.java:40)
    at redis.clients.jedis.Protocol.process(Protocol.java:151)
    at redis.clients.jedis.JedisFactory.validateObject(JedisFactory.java:214)

1.2 Preliminary Analysis

The error is concentrated in the JedisFactory.validateObject method, which validates pooled connections. Initial hypotheses included:

Connection‑pool misconfiguration or resource exhaustion.

Unstable network to the Redis server.

Performance problems within Redis itself.

After confirming the pool configuration and network health, the investigation focused on Redis.

2. Redis Service State Analysis

2.1 Log Analysis

Redis logs contain warnings such as:

[WARNING] Asynchronous AOF fsync is taking too long (disk is busy?).
Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.

Log Interpretation :

AOF persistence invokes fsync() and the system call takes too long.

Disk I/O is busy, causing delays.

Redis proceeds without waiting to avoid blocking the main thread, which degrades overall performance.

2.2 Performance Metrics

Running redis-cli info persistence reveals key indicators:

# Persistence
loading:0
rdb_changes_since_last_save:156789
rdb_bgsave_in_progress:0
rdb_last_save_time:1698123456
rdb_last_bgsave_status:ok
rdb_last_bgsave_time_sec:45
aof_enabled:1
aof_rewrite_in_progress:0
aof_rewrite_scheduled:0
aof_last_rewrite_time_sec:120
aof_current_rewrite_time_sec:-1
aof_last_bgrewrite_status:ok
aof_last_write_status:ok
aof_last_cow_size:0
aof_current_size:1429053125
aof_base_size:1234567890
aof_pending_rewrite:0
aof_buffer_length:0
aof_rewrite_buffer_length:0
aof_pending_bio_fsync:0
aof_delayed_fsync:20796

Key observations: aof_delayed_fsync: 20796 – many fsync operations were delayed over one second, indicating severe I/O blockage. aof_current_size: 1.4 GB – a large AOF file can further stress the disk. rdb_last_bgsave_time_sec: 45 – RDB snapshots take considerable time.

3. Configuration Review and Problem Localization

3.1 Current Configuration

# AOF configuration
appendonly yes
appendfilename "appendonly.aof"
appendfsync everysec
no-appendfsync-on-rewrite no
auto-aof-rewrite-percentage 100
auto-aof-rewrite-min-size 64mb

# RDB configuration
save 900 1
save 300 10
save 60 10000

3.2 Configuration Issues

Both AOF and RDB persistence run concurrently, causing competing writes to the disk.

During AOF rewrite, no-appendfsync-on-rewrite no keeps fsync active, further increasing I/O load.

The server uses a traditional HDD, whose random I/O performance is poor, amplifying the problem.

4. Root Cause Analysis

Concurrent execution of AOF fsync and RDB snapshot leads to disk I/O saturation → fsync delays → Redis main thread blocks → client requests time out.

5. Solution Design and Implementation

5.1 Optimization Strategies

Simplify persistence : Disable RDB snapshots and rely solely on AOF (Redis 3.x does not support mixed persistence).

Adjust AOF sync policy : Pause fsync during AOF rewrite to reduce I/O contention.

Tune rewrite parameters : Raise the rewrite trigger thresholds to lower rewrite frequency.

# Disable RDB snapshots
save ""

# Pause fsync during rewrite
no-appendfsync-on-rewrite yes

# Increase rewrite thresholds
auto-aof-rewrite-percentage 100
auto-aof-rewrite-min-size 128mb

5.2 Implementation Steps

Backup the existing Redis configuration file.

Apply AOF parameter changes and monitor the effect.

Once AOF stability is confirmed, disable RDB snapshots.

Continuously monitor key metrics to verify improvement.

5.3 Monitoring Indicators

# Key performance indicators
redis-cli info persistence | grep -E "(aof_delayed_fsync|rdb_bgsave_in_progress)"
redis-cli info stats | grep -E "(total_connections_received|rejected_connections)"
redis-cli info clients | grep connected_clients

6. Validation and Ongoing Optimization

6.1 Effect Verification

After 24 hours of the configuration changes, monitoring data showed dramatic improvements: aof_delayed_fsync dropped from 20796 to 23 (99.9 % reduction).

Average response time fell from 150 ms to 45 ms (70 % reduction).

Connection‑timeout rate fell from 15.2 % to 0.1 % (99.3 % reduction).

CPU usage decreased from 85 % to 45 %.

6.2 Long‑Term Monitoring Strategy

Daily checks of aof_delayed_fsync and disk I/O usage.

Alert when aof_delayed_fsync > 100, disk I/O > 80 %, or Redis connection‑timeout rate > 1 %.

Monthly review of AOF file growth and quarterly performance baseline testing.

Plan hardware upgrades (e.g., SSD) as needed.

7. Conclusion

The successful resolution of the Redis connection‑timeout issue underscores the value of systematic troubleshooting and deep understanding of Redis persistence mechanisms. Performance problems in distributed systems often involve interactions across multiple layers; establishing comprehensive monitoring and mastering component internals are essential for rapid diagnosis and reliable operation.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

monitoringredisperformance tuningAOFRDBconnection timeout
IT Services Circle
Written by

IT Services Circle

Delivering cutting-edge internet insights and practical learning resources. We're a passionate and principled IT media platform.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.