Why Did Our Redis Freeze? Uncovering AOF Risks and Recovery Strategies
A recent hardware failure left a physical server's disk read‑only, causing Redis to hang; this article explains the AOF mechanism, its potential pitfalls, log strategies, and practical steps to prevent and mitigate such issues in production environments.
Cause
Recently a physical machine's hard disk went offline and became read‑only, which caused Redis to stall and the operating system to report errors.
Redis Application Error
io.lettuce.core.RedisCommandExecutionException: MISCONF Errors writing to the AOF file: Read-only file system org.springframework.dao.QueryTimeoutException: Redis command timed out; nested exception is io.lettuce.core.RedisCommandTimeoutException: Command timed outThe file system being read‑only prevented AOF writes, and although the AOF policy was set to everysec , the main thread was still blocked because key reads were also stuck.
AOF Mechanism
AOF (Append‑Only File) is a write‑after log: Redis first executes commands in memory, then records each command as text in the AOF file, unlike traditional write‑ahead logs that store the modified data.
Traditional database redo logs record the changed data, while AOF records every command received by Redis.
AOF Log Content
Each log entry consists of parts prefixed by $ and a length, e.g., $3 set indicates a three‑byte command "set".
Avoid logging erroneous commands.
Does not block write operations.
Potential Risks of AOF
Data loss: If a crash occurs before the log is flushed to disk, the last command may be lost.
Main‑thread blocking: Although AOF avoids blocking the current command, the log is written by the main thread; heavy disk I/O can slow down subsequent operations.
Controlling when the AOF log is flushed mitigates these risks.
Log Strategies
1. Always – Synchronously write the log to disk after each command.
2. Everysec – Buffer log entries in memory and flush to disk every second.
3. No – Let the operating system decide when to flush.
Summary:
Choose No for highest performance.
Choose Always for maximum durability.
Choose Everysec for a balance, accepting minimal data loss.
Back to the Problem
Our everysec policy writes logs in a background thread, but because the file system was read‑only, the background thread hung, causing the main thread to wait indefinitely for fsync to complete, ultimately blocking all Redis operations.
How to Improve?
For disk failures, enhance Sentinel checks to verify writeability, not just ping.
To reduce I/O pressure:
Separate high‑IO applications from the Redis host.
Set no-appendfsync-on-rewrite to yes to skip fsync during AOF rewrite, accepting possible data loss.
Schedule backups and AOF writes per instance to spread I/O load.
These measures help prevent Redis from becoming unresponsive when the underlying storage encounters issues.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
