Why Did Redis Crash at 100% Memory? Uncovering Buffer Overflows and Best Practices
A detailed post‑mortem of a Redis outage shows how a traffic surge filled bandwidth, caused massive input and output buffers to consume almost all memory, and led to timeouts, while offering step‑by‑step analysis, memory diagnostics, and practical recommendations to prevent similar buffer‑overflow failures.
Incident Overview
During a traffic spike, a Redis instance reached 100% memory usage, causing timeouts and crashes. The key factors were large‑key calls, bandwidth saturation, and buffer memory (input/output) exhausting the instance.
Memory Analysis
INFO MEMORY output showed used_memory ≈1.02 GB, maxmemory 1 GB, and used_memory_overhead ≈1 GB, indicating most memory was consumed by client buffers rather than actual data (dataset only ~24 MB).
Buffer Overflow Mechanism
Redis uses input and output buffers for each client. With the default client-output-buffer-limit (32 MB for normal clients, 8 MB for Pub/Sub, 60 s), a large number of clients (e.g., 300) can theoretically consume up to 9.4 GB of buffer memory, far exceeding the instance’s 2 GB limit.
When output buffers grew beyond limits, connections were closed, forcing traffic to the database layer, which further increased SET traffic, creating a feedback loop that filled the input buffers and halted command processing.
Root Causes
Natural traffic growth filled bandwidth, leading to high write volume.
Large keys and massive SET requests saturated buffers.
Redis’s single‑threaded model could not keep up, causing intermittent blocking.
Insufficient monitoring of memory, bandwidth, and client buffer usage.
Mitigation and Best Practices
Key recommendations include:
Deploy Redis close to the application (same VPC).
Use separate Redis instances per business.
Choose appropriate eviction policies (default volatile-lru).
Limit key size (<10 KB) and sub‑key count (<1 000).
Serialize values with readable formats.
Use connection pools (JedisPool/JedisCluster) and set generous timeouts.
Avoid range queries ( KEYS *) and heavy Lua scripts; prefer Redis modules.
Enable monitoring, slow‑log analysis, top‑key statistics, and audit logs when needed.
Configure client-output-buffer-limit according to workload.
Operational Commands
Useful commands for incident response: CLIENT LIST, INFO MEMORY, MEMORY USAGE, and regular slow‑log checks.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
