Analyzing and Resolving an R2M Cache Usage Alert Before the 618 Promotion
This article walks through a real‑world R2M (Redis‑like) cache alert, detailing the email notification, large‑key analysis, code inspection, root‑cause identification, and both immediate and long‑term solutions that reduced cache usage by over 97% and prevented future incidents.
1 Problem Investigation
1.1 Email Alert
On the eve of the 618 duty shift, an email alert was received stating that the R2M cluster usage had reached 85% and required urgent handling.
The cache cluster is configured with a total capacity of 32,400 MB (three masters and three slaves, each master 10,800 MB). The most used node had already reached 9,087 MB. R2M is deployed in cluster mode.
1.2 Code Analysis
Large keys fall into two categories: xxx_data and xxx_interfacecode_01 . The following code snippets show where these keys are generated:
String dataKey = task.getTaskNo() + "_data";
cacheClusterClient.setex(dataKey.getBytes(), EXPIRATION, DataUtil.objectToByte(paramList));
key = task.getTaskNo() + "_" + item.getInterfaceCode() + "_" + partCount;
cacheClusterClient.setex(key.getBytes(), EXPIRATION, DataUtil.objectToByte(dataList));After locating the code, the business flow was examined (see diagram in the original article).
1.3 Alert Causes
The high usage can be divided into direct and root causes:
1.3.1 Direct Cause
Users created a large number of batch tasks over the past three days, increasing the amount of samples and results stored in the cache.
1.3.2 Root Cause
Samples are cached temporarily and then retrieved, which provides no benefit and only consumes memory.
Results are stored in shards; this is useful to avoid large JVM memory consumption during parallel processing, but the shards still occupy cache space.
After a batch finishes, intermediate data is not actively deleted; it relies on TTL expiration, causing data to linger in the cache for an extended period.
2 Problem Resolution
2.1 Direct Cause
2.1.1 Analysis
Because the system was frozen before the 618 promotion, the immediate mitigation was to stop users from creating new batch tasks.
Monitoring showed that although the TTL for keys was set to one day, the cache usage kept rising for three days, suggesting that expired keys were not being physically removed.
Redis documentation explains that when many keys have a TTL of 0, there can be a noticeable delay before they are actually deleted because the lazy deletion process runs gradually.
2.1.2 Solution
Redis deletes expired keys in two ways: (1) when a key is accessed, it is checked and removed if expired; (2) a background lazy‑deletion thread gradually scans and removes expired keys.
Instead of relying on access‑triggered deletion, the team increased the frequency of the lazy‑deletion scan.
After contacting the R2M operations team, they discovered a configuration parameter controlling the lazy‑deletion speed had been set to 10 (six times slower than default). The parameter was raised to 80 after the 618 incident.
Post‑adjustment monitoring showed a sharp decline in cache usage, confirming that the direct cause was resolved.
2.2 Root Cause
Even with the parameter change, a massive one‑day batch could still cause high cache usage. To address the fundamental issue, the batch workflow was redesigned:
Do not store samples in the cache; pass them directly as method parameters.
Store result shards in OSS (object storage) instead of Redis, since OSS is cheap and latency is not critical for offline jobs.
After the batch completes, actively delete the OSS result shards and set a 7‑day automatic expiration to avoid orphaned data.
The revised flow diagram (original image) reflects these changes.
2.3 Optimization Effect
After deployment, cache usage dropped dramatically, achieving an optimization rate of approximately 97.96% ((8.35 – 0.17) / 8.35).
3 Summary
The case study demonstrates how a systematic investigation—from email alert to large‑key scanning, code review, and understanding Redis expiration mechanics—can uncover both surface and deep causes of cache pressure, and how targeted configuration tweaks and architectural redesign can virtually eliminate the issue.
3.1 Use the Right Middleware for the Right Job
Different middlewares excel at different tasks; Redis is ideal for small, fast data, while OSS (or other storage) suits large, less‑time‑critical data.
3.2 Learning Technical Details Pays Off
Applying recent Redis knowledge directly to a production incident bridged theory and practice, reinforcing the value of staying up‑to‑date with underlying system behaviors.
Selected Java Interview Questions
A professional Java tech channel sharing common knowledge to help developers fill gaps. Follow us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.