How to Diagnose and Prevent Redis Data Loss: Real-World Cases and Best Practices
This article examines why Redis data can disappear, walks through a real incident where 90,000 keys vanished, and provides concrete monitoring, diagnostic, and preventive measures to avoid cache‑only and storage‑critical data loss scenarios.
Background
Redis is often used as a pure cache in front of a primary storage system such as MySQL or HBase; when a key is missing, the application falls back to the primary store and updates the cache. In many cases a small amount of data loss is unnoticed, but as Redis is increasingly used for persistent storage, losing data can be catastrophic.
Incident Story: 90,000 Keys Disappeared
A developer (RD) noticed that over 90,000 keys with the prefix t_list vanished overnight while other keys remained. The DBA confirmed no maintenance operations had been performed and began investigating.
Key observations during troubleshooting:
Missing keys had no TTL; expired_keys metric was zero.
Memory usage ( used_memory_pct) was well below 100 %; evicted_keys remained zero.
Key count dropped sharply but never reached zero, ruling out FLUSHALL / FLUSHDB.
Command statistics showed a sudden surge in DEL operations starting at 22:01, reaching dozens per second.
Slowlog revealed a KEYS tlist* command executed at 22:01, likely followed by bulk deletion.
The investigation concluded that a manual or programmatic bulk delete caused the loss.
Impact of Data Loss
When Redis stores critical data, loss is unacceptable because persistence mechanisms (RDB snapshots, AOF rewrite) cannot guarantee point‑in‑time consistency like relational databases. In cache‑only scenarios, massive loss can overload the primary store and trigger application‑wide failures.
Common Causes of Redis Data Loss
Program bugs or human error (e.g., accidental DEL, FLUSHALL, FLUSHDB).
Excessive client buffer memory leading to LRU eviction.
Automatic restart of a primary node after failure, causing a full resynchronization that wipes data.
Network partitions that create short windows where writes are lost.
Inconsistent master‑slave replication resulting in data loss after failover.
Mass expiration and eviction of keys, which may be mistaken for loss.
Prevention and Monitoring Strategies
1. Guard Dangerous Commands
Rename or disable commands such as KEYS, FLUSHALL, FLUSHDB in production.
2. Essential Monitoring Metrics
Current key count ( dbsize or INFO), with historical graphs to spot sudden drops.
Command execution counters: cmdstat_del, cmdstat_flushall, cmdstat_flushdb.
Memory usage ( used_memory), client buffer sizes ( client_longest_output_list, client_biggest_input_buf), and LRU eviction count ( evicted_keys).
Expired key count ( expired_keys) to differentiate true loss from normal expiration.
3. Client Buffer Management
Allocate sufficient maxmemory and monitor client output buffers. Set reasonable limits (e.g., 10 MB for normal clients, 1 GB for slaves) and adjust slave-client-output-buffer-limit when replication stalls.
4. Avoid Automatic Restarts of Primary Nodes
Disable aggressive auto‑restart policies for master instances; a rapid restart can trigger a full sync that overwrites data with an empty snapshot.
5. Network Partition Awareness
Implement short‑interval health checks and configure Sentinel or cluster failover thresholds to minimize the window where writes may be lost.
6. Replication Consistency Checks
Regularly verify master‑slave data parity and monitor failover events to ensure no data is dropped during role changes.
Illustrative Monitoring Charts
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
