Databases 25 min read

Why Did Redis Crash at 100% Memory? Deep Dive into Buffer Overflows and Mitigation

An incident where massive key traffic pushed Redis memory usage to 100% revealed that buffer memory, not the dataset itself, exhausted the instance, leading to timeouts and crashes; the analysis explains the root causes, shows detailed INFO MEMORY output, and provides practical mitigation guidelines.

ITPUB
ITPUB
ITPUB
Why Did Redis Crash at 100% Memory? Deep Dive into Buffer Overflows and Mitigation

The article analyzes a real‑world Redis outage in which a sudden surge of large‑key traffic caused the instance’s memory usage to reach 100%, making Redis unavailable despite the actual dataset occupying only a few megabytes.

Memory Usage Snapshot

Running INFO MEMORY on the affected instance produced the following output, showing that used_memory (≈1.02 GB) almost equals the configured maxmemory (1 GB), while used_memory_dataset is only 23.9 MB. The bulk of memory is taken by used_memory_overhead (≈1 GB), which includes client buffers and other internal structures.

# Memory
used_memory:1072693248
used_memory_human:1023.99M
used_memory_rss:1090519040
used_memory_rss_human:1.02G
used_memory_peak:1072693248
used_memory_peak_human:1023.99M
used_memory_peak_perc:100.00%
used_memory_overhead:1048576000
used_memory_startup:1024000
used_memory_dataset:23929848
used_memory_dataset_perc:2.23%
allocator_allocated:1072693248
allocator_active:1090519040
allocator_resident:1090519040
total_system_memory:16777216000
total_system_memory_human:16.00G
used_memory_lua:37888
used_memory_lua_human:37.89K
used_memory_scripts:1024000
used_memory_scripts_human:1.00M
maxmemory:1073741824
maxmemory_human:1.00G
maxmemory_policy:noeviction
allocator_frag_ratio:1.02
allocator_frag_bytes:17825792
allocator_rss_ratio:1.00
allocator_rss_bytes:0
rss_overhead_ratio:1.00
rss_overhead_bytes:0
mem_fragmentation_ratio:1.02
mem_fragmentation_bytes:17825792
mem_not_counted_for_evict:0
mem_replication_backlog:0
mem_clients_slaves:0
mem_clients_normal:1048576000
mem_aof_buffer:0
mem_allocator:jemalloc-5.1.0
active_defrag_running:0
lazyfree_pending_objects:0

Why Buffers Consumed the Memory

Redis allocates an output buffer for each client to hold reply data before it is sent over the network. The default client-output-buffer-limit (32 MB for normal clients, 8 MB for slaves, 60 s for Pub/Sub) can be exceeded when many large replies are generated quickly, especially with big keys or bulk commands. In a clustered setup, up to 300 clients per instance can each consume the full limit, theoretically reaching >9 GB of buffer memory.

The input buffer stores incoming command data. When the Redis main thread cannot process commands fast enough—e.g., because each SET writes a multi‑megabyte value—the input buffer grows until it hits its own limit, causing the server to block further reads.

Incident Timeline and Root Causes

Natural traffic growth increased outbound bandwidth to ~96 MB/s, pushing the output buffer toward its hard limit.

The output buffer overflow caused many client connections to be closed, forcing the application to fall back to database reads.

Fallback DB reads generated a burst of SET commands; each SET wrote a ~2 MB value, further saturating inbound bandwidth.

The Redis single‑threaded model could not keep up, leading to intermittent blocking and a rapid increase of the input buffer.

Both input and output buffers eventually consumed the entire memory quota, leaving almost no space for actual data.

Consequently, normal GET / SET operations failed, and the instance appeared completely unusable.

Why Eviction Did Not Help

Redis’s eviction policy (default volatile‑lru) only removes keys with an expiration time. In this case the memory pressure came from client buffers, not from stored keys, so eviction never triggered and the instance ran out of memory.

Mitigation and Best‑Practice Recommendations

Configure client-output-buffer-limit according to workload; consider lowering the limits for normal clients or increasing them only for trusted connections.

Monitor used_memory_overhead and the ratio of mem_clients_normal to total memory; set alerts on sudden spikes.

Avoid large values (>10 KB) in strings; split big collections into multiple smaller keys (keep sub‑key count < 1,000).

Serialize and compress large objects (e.g., using Protostuff, Kryo, or FST) before storing them.

Use SCAN instead of KEYS * for key enumeration; enable slow‑log and real‑time top‑key monitoring.

Prefer JedisPool or JedisCluster in Java clients; set generous connection timeouts and implement retry back‑off to avoid thundering‑herd retries.

Periodically run Redis diagnostics (slow‑log analysis, top‑key stats, cache analysis) and enable audit logs only when needed, as they can add 5‑15 % overhead.

Operational Checklist

Deploy Redis close to the application (same VPC) to minimize network latency.

Allocate a dedicated instance per business domain to avoid mixed workloads.

Set appropriate expiration policies (e.g., volatile‑lru) and avoid using noeviction for cache use‑cases.

Enable monitoring of CPU, memory, and bandwidth; configure alerts for thresholds.

Regularly review and prune big keys using the CloudDBA cache analysis tool.

By understanding that the real bottleneck was buffer memory rather than dataset size, operators can adjust configuration, limit large keys, and monitor buffer usage to prevent similar outages.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

performanceCacheMemory ManagementredisKey Designbuffer overflow
ITPUB
Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.