Why Did Redis Connections Time Out? A Deep Dive into AOF, RDB, and Disk I/O
This article walks through a production Redis connection timeout incident, detailing how concurrent AOF persistence and RDB snapshots caused disk I/O blockage, the diagnostic steps taken, and the optimization measures implemented to eliminate the timeouts and improve performance.
Abstract : This article provides a detailed post‑mortem of a Redis connection‑timeout outage in production. Systematic problem identification, root‑cause analysis, and remediation revealed that concurrent AOF persistence and RDB snapshots caused disk I/O blockage. The write‑up offers the full troubleshooting workflow, technical analysis, and optimization tactics for similar issues.
1. Problem Background and Symptoms
1.1 Issue Overview
One morning the monitoring system generated a flood of alerts indicating an abnormal rise in interface response timeout rates. Application logs showed many Redis connection‑timeout exceptions with a typical stack trace.
redis.clients.jedis.exceptions.JedisConnectionException:
java.net.SocketTimeoutException: Read timed out
at redis.clients.jedis.util.RedisInputStream.ensureFill(RedisInputStream.java:205)
at redis.clients.jedis.util.RedisInputStream.readByte(RedisInputStream.java:40)
at redis.clients.jedis.Protocol.process(Protocol.java:151)
at redis.clients.jedis.JedisFactory.validateObject(JedisFactory.java:214)1.2 Preliminary Analysis
The error is concentrated in the JedisFactory.validateObject method, which validates pooled connections. Initial hypotheses included:
Connection‑pool misconfiguration or resource exhaustion.
Unstable network to the Redis server.
Performance problems within Redis itself.
After confirming the pool configuration and network health, the investigation focused on Redis.
2. Redis Service State Analysis
2.1 Log Analysis
Redis logs contain warnings such as:
[WARNING] Asynchronous AOF fsync is taking too long (disk is busy?).
Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.Log Interpretation :
AOF persistence invokes fsync() and the system call takes too long.
Disk I/O is busy, causing delays.
Redis proceeds without waiting to avoid blocking the main thread, which degrades overall performance.
2.2 Performance Metrics
Running redis-cli info persistence reveals key indicators:
# Persistence
loading:0
rdb_changes_since_last_save:156789
rdb_bgsave_in_progress:0
rdb_last_save_time:1698123456
rdb_last_bgsave_status:ok
rdb_last_bgsave_time_sec:45
aof_enabled:1
aof_rewrite_in_progress:0
aof_rewrite_scheduled:0
aof_last_rewrite_time_sec:120
aof_current_rewrite_time_sec:-1
aof_last_bgrewrite_status:ok
aof_last_write_status:ok
aof_last_cow_size:0
aof_current_size:1429053125
aof_base_size:1234567890
aof_pending_rewrite:0
aof_buffer_length:0
aof_rewrite_buffer_length:0
aof_pending_bio_fsync:0
aof_delayed_fsync:20796Key observations: aof_delayed_fsync: 20796 – many fsync operations were delayed over one second, indicating severe I/O blockage. aof_current_size: 1.4 GB – a large AOF file can further stress the disk. rdb_last_bgsave_time_sec: 45 – RDB snapshots take considerable time.
3. Configuration Review and Problem Localization
3.1 Current Configuration
# AOF configuration
appendonly yes
appendfilename "appendonly.aof"
appendfsync everysec
no-appendfsync-on-rewrite no
auto-aof-rewrite-percentage 100
auto-aof-rewrite-min-size 64mb
# RDB configuration
save 900 1
save 300 10
save 60 100003.2 Configuration Issues
Both AOF and RDB persistence run concurrently, causing competing writes to the disk.
During AOF rewrite, no-appendfsync-on-rewrite no keeps fsync active, further increasing I/O load.
The server uses a traditional HDD, whose random I/O performance is poor, amplifying the problem.
4. Root Cause Analysis
Concurrent execution of AOF fsync and RDB snapshot leads to disk I/O saturation → fsync delays → Redis main thread blocks → client requests time out.
5. Solution Design and Implementation
5.1 Optimization Strategies
Simplify persistence : Disable RDB snapshots and rely solely on AOF (Redis 3.x does not support mixed persistence).
Adjust AOF sync policy : Pause fsync during AOF rewrite to reduce I/O contention.
Tune rewrite parameters : Raise the rewrite trigger thresholds to lower rewrite frequency.
# Disable RDB snapshots
save ""
# Pause fsync during rewrite
no-appendfsync-on-rewrite yes
# Increase rewrite thresholds
auto-aof-rewrite-percentage 100
auto-aof-rewrite-min-size 128mb5.2 Implementation Steps
Backup the existing Redis configuration file.
Apply AOF parameter changes and monitor the effect.
Once AOF stability is confirmed, disable RDB snapshots.
Continuously monitor key metrics to verify improvement.
5.3 Monitoring Indicators
# Key performance indicators
redis-cli info persistence | grep -E "(aof_delayed_fsync|rdb_bgsave_in_progress)"
redis-cli info stats | grep -E "(total_connections_received|rejected_connections)"
redis-cli info clients | grep connected_clients6. Validation and Ongoing Optimization
6.1 Effect Verification
After 24 hours of the configuration changes, monitoring data showed dramatic improvements: aof_delayed_fsync dropped from 20796 to 23 (99.9 % reduction).
Average response time fell from 150 ms to 45 ms (70 % reduction).
Connection‑timeout rate fell from 15.2 % to 0.1 % (99.3 % reduction).
CPU usage decreased from 85 % to 45 %.
6.2 Long‑Term Monitoring Strategy
Daily checks of aof_delayed_fsync and disk I/O usage.
Alert when aof_delayed_fsync > 100, disk I/O > 80 %, or Redis connection‑timeout rate > 1 %.
Monthly review of AOF file growth and quarterly performance baseline testing.
Plan hardware upgrades (e.g., SSD) as needed.
7. Conclusion
The successful resolution of the Redis connection‑timeout issue underscores the value of systematic troubleshooting and deep understanding of Redis persistence mechanisms. Performance problems in distributed systems often involve interactions across multiple layers; establishing comprehensive monitoring and mastering component internals are essential for rapid diagnosis and reliable operation.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
IT Services Circle
Delivering cutting-edge internet insights and practical learning resources. We're a passionate and principled IT media platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
