How to Detect and Resolve Redis Big‑Key Issues That Cause Service Timeouts
This article walks through a real‑world incident where a Redis cache big‑key caused service timeouts, explains how to identify the problem using cluster metrics, outlines detection tools and commands, and provides practical steps to delete big keys and prevent future occurrences.
Problem Description
At 19:44 an alert indicated that an online API's tp99 latency spiked to over 300 ms (normal 8 ms). Monitoring showed intermittent timeouts on random machines, and the cache‑dependent service had its degradation switch off. By 20:13 the root cause was identified as a single Redis big‑key exceeding 5 MB that generated continuous requests.
Observation (望)
High‑concurrency traffic stresses distributed caches like Redis. When a large key is repeatedly accessed, the affected shard’s outbound traffic surges while other shards remain normal. Redis Cluster distributes keys across 16,384 hash slots, so a big key resides on a single shard, making its traffic pattern a clear indicator.
Investigation (闻)
Key symptoms include:
One shard receives modest inbound traffic but massive outbound traffic.
Only a specific shard experiences timeouts while others are normal.
Understanding Redis Cluster’s slot allocation helps quickly pinpoint the problematic shard.
Detection Methods (问)
Several tools can scan for big keys: redis-rdb-tools: Run bgsave on the instance, then analyze the generated dump.rdb with rdb -c memory dump.rdb to list large keys. redis-cli --bigkeys: Shows the biggest keys for each data type (string, hash, list, set, zset).
Custom Python scripts that iterate over keys similarly to --bigkeys.
After locating the big key, delete it with the DEL command.
Resolution (切)
To prevent recurrence:
Avoid using Redis as a primary store for complex data structures; split large objects at design time.
Introduce validation layers before caching to reject keys exceeding a size threshold and raise alerts.
Redis uses three buffers per client: input, replication/AOF, and output. Big‑key responses can overflow the client’s input buffer, causing connection interruptions.
Output buffer overflow can also occur due to:
Large responses from big‑key requests.
Running the MONITOR command.
Improper buffer size settings.
Adjust the client‑output‑buffer limits, for example: client-output-buffer-limit normal 0 0 0 (no limit for normal clients). client-output-buffer-limit pubsub 8mb 2mb 60 (close connection if >2 MB within 60 s).
Additional preventive steps:
Avoid storing big keys.
Do not use MONITOR in production.
Set reasonable client-output-buffer-limit values.
These measures address why large keys cause client‑server link interruptions and help maintain stable Redis performance.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
