Databases 10 min read

Investigation and Root Cause Analysis of a Redis Memory Leak in Production

An in‑depth, timeline‑driven investigation of a production Redis memory leak revealed that the custom 3.2.8 build’s getKeysInSlot function failed to free a temporary key‑array after traversing the radix‑tree, causing hundreds of megabytes of leaked SDS strings, which was fixed by adding a single free call and highlighted the need for functional code reviews and early leak detection.

Didi Tech
Didi Tech
Didi Tech
Investigation and Root Cause Analysis of a Redis Memory Leak in Production

Redis, a high‑performance in‑memory key‑value database, is widely used in performance‑critical systems and is a major memory consumer at Didi. This article presents a timeline‑driven investigation of a production Redis memory leak, describing Linux memory‑leak diagnosis methods and tools.

16:30 – Problem Exposure After a scaling‑down operation, the system triggered a 90% memory‑usage alarm. Only about 10,000 keys existed and no keys larger than 512 bytes were found.

16:40 – Leak Confirmation Certain instances showed memory usage of 300–800 MB while normal instances used ~10 MB. The affected version was 4ce35dea ; a similar leak had been observed in an older version 49bdcd0b in September, indicating a long‑standing issue.

17:30 – Checking Community Version The commit history of the 3.2.8 community branch was examined. Only one commit related to a memory leak was found: Memory leak in clusterRedirectBlockedClientIfNeeded.

18:10 – Organizing Monitoring and Logs Monitoring data showed that the leak started two months ago, was not continuous, and was triggered by a specific event. The leak amount was large (≈800 MB on the primary instance vs. 10 MB on normal instances). Relevant screenshots of monitoring graphs and logs are included in the original article.

18:00 – Dumping Memory Using GDB, a full memory dump of the leaking instance was obtained. The dump revealed ~647 w keys that did not belong to the node, while the DB reported only ~1.6 w keys, suggesting an issue with slot migration.

18:30 – First Code Diff The custom 3.2.8 version introduced two major changes: (1) the slot key‑set storage was switched from a skip‑list to a radix‑tree (rax) structure (back‑ported from the unstable branch), and (2) multi‑active support was added.

20:30 – Tool‑Based Diagnosis Several tools were tried: Memory Doctor – a Redis‑4 memory‑diagnosis command not implemented in the 3 series. jemalloc profiling (jeprof) – required recompilation with --enable‑prof , which was not feasible. perf – captured brk system calls but found no anomalies. valgrind – considered as a last resort.

22:00 – Team Communication The team suspected a leak in the rax implementation or a combination of multi‑active/failover actions.

Next Day – Hexdump Analysis Hexdump of the memory dump showed that the leaked memory consisted of SDS (simple dynamic string) structures, each about 80 bytes long, stored as sdshdr8 . The pattern of continuous “OO TT SS” characters matched the layout of SDS headers.

Root Cause Identification The function getKeysInSlot traverses the rax tree, collects key strings, and returns them as an array of pointers. After the keys are sent to the client, the array is not freed, leading to a massive memory leak. In the original 3.2.8 code that used a skip‑list, each node already held an obj pointer, so no copy (and thus no extra free) was required.

Fix The fix required adding a single line of code to free the returned key strings after they are sent to the client. A screenshot of the code change is provided in the source.

Post‑mortem Thoughts Code review should be performed from a functional perspective, not only on diff lines. Early detection of memory‑leak risks during design and review is far cheaper than post‑production debugging. Dynamic analysis tools (valgrind, sanitizers) and production tools (memleak, perf) should be integrated into testing pipelines.

backendDebuggingPerformanceRedisLinuxMemory LeakSlot Migration
Didi Tech
Written by

Didi Tech

Official Didi technology account

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.