How to Diagnose and Prevent Redis Data Loss in Production

This article examines common causes of Redis data loss, walks through a real‑world incident where 90,000 keys vanished, and provides concrete monitoring, configuration, and operational safeguards to detect and avoid such failures.

ITPUB
ITPUB
ITPUB
How to Diagnose and Prevent Redis Data Loss in Production

Redis is often used as a pure cache in front of a primary storage system such as MySQL or HBase; when a cache miss occurs, the backend fetches data and updates the cache. Small data loss in a cache‑only scenario is usually unnoticed, but when Redis stores critical data, loss becomes catastrophic.

Incident Overview

A production incident on December 3rd showed 90,000 keys with the prefix t_list disappearing overnight. The RD (developer) noticed the loss and questioned the DBA about possible cleanup operations.

Investigation revealed:

No TTL was set for the affected keys, and expired_keys remained zero.

Memory usage ( used_memory_pct) was below 100%, and evicted_keys stayed at zero, ruling out memory‑pressure eviction.

Key count dropped sharply, eliminating FLUSHALL / FLUSHDB as causes.

Command statistics showed a sudden surge in DEL operations starting at 22:01, about dozens per second.

Slowlog entries confirmed a bulk KEYS tlist* followed by mass deletions.

Further monitoring indicated that between 22:00 and 22:40, roughly 90,000 keys were deleted across three shards, with about 12 DEL commands per second for 40 minutes.

Impact of Data Loss

When Redis is used for persistent storage, loss is unacceptable because its persistence mechanisms (RDB snapshots, AOF rewrite) cannot guarantee point‑in‑time recovery like relational databases. In cache‑only use cases, massive loss can overload the primary store, causing an application‑wide “snowball” effect.

Common Causes of Redis Data Loss

Application bugs or human error (e.g., accidental DEL, FLUSHALL).

Excessive client output buffer usage leading to LRU eviction.

Automatic restart of a master after failure, causing a fresh instance to sync from a stale snapshot.

Network partitions that create short windows where writes are lost during failover.

Inconsistent master‑slave replication leading to data loss after failover.

Mass expiration of keys during cleanup, which may be mistaken for loss.

Prevention and Monitoring Strategies

Protect Dangerous Commands

Rename or disable high‑risk commands such as KEYS, FLUSHALL, FLUSHDB in production.

Key Monitoring Metrics

Track total key count ( dbsize from INFO) to spot sudden drops.

Monitor command statistics: cmdstat_del, cmdstat_flushall, cmdstat_flushdb.

Watch memory usage ( used_memory, used_memory_pct) and LRU eviction count ( evicted_keys).

Observe client buffer metrics: client_longest_output_list, client_biggest_input_buf.

Enable slowlog to capture bulk operations like KEYS followed by mass DEL.

Client Buffer Management

Allocate sufficient maxmemory and reserve extra headroom for client buffers. Set reasonable output buffer limits (e.g., 10 MB for normal clients, 1 GB for slaves). Adjust slave client-output-buffer-limit when replication traffic spikes.

Master Restart Policies

Avoid automatic restart of Redis masters without proper persistence configuration. Use external watchdogs that verify a valid snapshot before relaunching, or disable auto‑restart for stateful services.

Network Partition Handling

Configure Sentinel or cluster failover timeouts to minimize the window where writes may be lost. Monitor split‑brain scenarios and ensure logs are aggregated (e.g., via ELK) for rapid alerting.

Replication Consistency

Regularly verify master‑slave data consistency and test failover procedures to ensure no data loss occurs when a slave is promoted.

Expiration Cleanup Awareness

Distinguish genuine loss from normal expiration by correlating expired_keys growth with total key count reduction.

Conclusion

Redis data loss can stem from operational mistakes, misconfiguration, or infrastructure failures. Fine‑grained monitoring, disciplined command control, proper memory planning, and cautious restart policies together form a robust defense against accidental data loss.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

BackendmonitoringOperationsredistroubleshootingData loss
ITPUB
Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.