Databases 35 min read

Redis Crash Interview: How to Recover a Failed Node and Estimate Data Loss

This article walks through a systematic emergency response for Redis outages, explains how Redis Cluster promotes a replica, quantifies the typical data‑loss window from hundreds of milliseconds to several seconds, and provides detailed persistence configurations (RDB, AOF, and hybrid) to minimise downtime and data loss.

Tech Freedom Circle
Tech Freedom Circle
Tech Freedom Circle
Redis Crash Interview: How to Recover a Failed Node and Estimate Data Loss

Interview Question: What to Do When Redis Crashes?

Redis is an in‑memory database; a crash can cause data loss. This guide explains a step‑by‑step emergency process, persistence mechanisms, cluster failover behavior and configuration tips to minimise downtime.

Quick “stop‑bleeding” process

Check service status (e.g., ps aux | grep redis or systemctl status redis).

Inspect the log ( /var/log/redis/redis-server.log) for out‑of‑memory, background‑save failure, or timeout errors.

Restart the service ( systemctl restart redis or redis-server /path/to/redis.conf).

If restart fails, enable application‑level degradation: read from the primary database or write to the database and a message queue while Redis recovers.

Data recovery after the service is up

Recovery method depends on the persistence mode that was enabled.

AOF (Append‑Only File) : Redis replays all write commands from the AOF file. Ensure appendonly yes and a correct appendfilename.

RDB (Snapshot) : Redis loads the latest dump.rdb. Ensure the snapshot file is placed in the directory defined by dir.

Backup restore : Replace the data directory with the most recent backup and restart.

Cluster master‑node failure and data‑loss window

When a master in a Redis Cluster fails, a replica is promoted to master. The data‑loss window is typically a few hundred milliseconds to a few seconds. Example: if replication lag is 500 ms, up to 500 ms of writes may be lost.

Two scenarios:

Normal asynchronous replication : loss time 0.5 s – several s.

Extreme case (large lag, configuration error) : loss can extend to seconds or minutes.

Configuration parameters that bound the window: min‑slaves‑max‑lag (default 5 s) – maximum allowed replication delay. min‑slaves‑to‑write (default 1) – number of replicas that must be in sync before the master accepts writes.

Persistence configuration for each node

Every node in a Redis Cluster has its own redis.conf. The two main persistence options are:

RDB – periodic snapshots defined by save directives, e.g. save 60 1000 (snapshot if 1000 writes occur within 60 s).

AOF – logs every write command. The write‑back policy is set with appendfsync (always, everysec, no).

Recommended production settings (excerpt):

# Enable both RDB and AOF
appendonly yes
appendfilename "appendonly-6379.aof"
appendfsync everysec          # at most 1 s of data loss
auto-aof-rewrite-percentage 100
auto-aof-rewrite-min-size 64mb

# RDB snapshot schedule
save 3600 1
save 1800 10
save 60 10000
dbfilename dump-6379.rdb
dir /data/redis/6379

Hybrid persistence (RDB + AOF)

Redis 4.0+ supports aof‑use‑rdb‑preamble yes. When an AOF rewrite runs, the new file starts with an RDB snapshot followed by incremental AOF commands. This combines fast recovery of RDB with the durability of AOF.

Key operational recommendations

Never run a single‑node Redis in production; use Sentinel or Cluster for high availability.

Set appropriate memory limits ( maxmemory) and eviction policy ( maxmemory‑policy noeviction) for cluster mode.

Monitor replication lag ( info replicationlag) and configure min‑slaves‑max‑lag to bound data‑loss windows.

Regularly back up the dir directory to an off‑site location.

During AOF rewrite, Redis forks a child process; the parent continues serving requests, and new writes are buffered in both the AOF buffer and the rewrite buffer to guarantee consistency.

Choosing a persistence strategy

Summary:

RDB – small files, fast restart, possible loss of minutes of data.

AOF – near‑real‑time durability (≤1 s loss), larger files, slight write‑through overhead.

Hybrid – combines fast restart and ≤1 s loss, recommended for production.

For most production workloads, enable both RDB and AOF with aof‑use‑rdb‑preamble yes and tune appendfsync everysec together with sensible save intervals.

RedisPersistenceClusterAOFRDBFailover
Tech Freedom Circle
Written by

Tech Freedom Circle

Crazy Maker Circle (Tech Freedom Architecture Circle): a community of tech enthusiasts, experts, and high‑performance fans. Many top‑level masters, architects, and hobbyists have achieved tech freedom; another wave of go‑getters are hustling hard toward tech freedom.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.