Backend Development 12 min read

Analyzing and Resolving an R2M Cache Usage Alert Before the 618 Promotion

This article walks through a real‑world R2M (Redis‑like) cache alert, detailing the email notification, large‑key analysis, code inspection, root‑cause identification, and both immediate and long‑term solutions that reduced cache usage by over 97% and prevented future incidents.

Selected Java Interview Questions

Oct 28, 2023

Analyzing and Resolving an R2M Cache Usage Alert Before the 618 Promotion

1 Problem Investigation

1.1 Email Alert

On the eve of the 618 duty shift, an email alert was received stating that the R2M cluster usage had reached 85% and required urgent handling.

The cache cluster is configured with a total capacity of 32,400 MB (three masters and three slaves, each master 10,800 MB). The most used node had already reached 9,087 MB. R2M is deployed in cluster mode.

1.2 Code Analysis

Large keys fall into two categories: xxx_data and xxx_interfacecode_01. The following code snippets show where these keys are generated:

String dataKey = task.getTaskNo() + "_data";
cacheClusterClient.setex(dataKey.getBytes(), EXPIRATION, DataUtil.objectToByte(paramList));

key = task.getTaskNo() + "_" + item.getInterfaceCode() + "_" + partCount;
cacheClusterClient.setex(key.getBytes(), EXPIRATION, DataUtil.objectToByte(dataList));

After locating the code, the business flow was examined (see diagram in the original article).

1.3 Alert Causes

The high usage can be divided into direct and root causes:

1.3.1 Direct Cause

Users created a large number of batch tasks over the past three days, increasing the amount of samples and results stored in the cache.

1.3.2 Root Cause

Samples are cached temporarily and then retrieved, which provides no benefit and only consumes memory.

Results are stored in shards; this is useful to avoid large JVM memory consumption during parallel processing, but the shards still occupy cache space.

After a batch finishes, intermediate data is not actively deleted; it relies on TTL expiration, causing data to linger in the cache for an extended period.

2 Problem Resolution

2.1 Direct Cause

2.1.1 Analysis

Because the system was frozen before the 618 promotion, the immediate mitigation was to stop users from creating new batch tasks.

Monitoring showed that although the TTL for keys was set to one day, the cache usage kept rising for three days, suggesting that expired keys were not being physically removed.

Redis documentation explains that when many keys have a TTL of 0, there can be a noticeable delay before they are actually deleted because the lazy deletion process runs gradually.

2.1.2 Solution

Redis deletes expired keys in two ways: (1) when a key is accessed, it is checked and removed if expired; (2) a background lazy‑deletion thread gradually scans and removes expired keys.

Instead of relying on access‑triggered deletion, the team increased the frequency of the lazy‑deletion scan.

After contacting the R2M operations team, they discovered a configuration parameter controlling the lazy‑deletion speed had been set to 10 (six times slower than default). The parameter was raised to 80 after the 618 incident.

Post‑adjustment monitoring showed a sharp decline in cache usage, confirming that the direct cause was resolved.

2.2 Root Cause

Even with the parameter change, a massive one‑day batch could still cause high cache usage. To address the fundamental issue, the batch workflow was redesigned:

Do not store samples in the cache; pass them directly as method parameters.

Store result shards in OSS (object storage) instead of Redis, since OSS is cheap and latency is not critical for offline jobs.

After the batch completes, actively delete the OSS result shards and set a 7‑day automatic expiration to avoid orphaned data.

The revised flow diagram (original image) reflects these changes.

2.3 Optimization Effect

After deployment, cache usage dropped dramatically, achieving an optimization rate of approximately 97.96% ((8.35 – 0.17) / 8.35).

3 Summary

The case study demonstrates how a systematic investigation—from email alert to large‑key scanning, code review, and understanding Redis expiration mechanics—can uncover both surface and deep causes of cache pressure, and how targeted configuration tweaks and architectural redesign can virtually eliminate the issue.

3.1 Use the Right Middleware for the Right Job

Different middlewares excel at different tasks; Redis is ideal for small, fast data, while OSS (or other storage) suits large, less‑time‑critical data.

3.2 Learning Technical Details Pays Off

Applying recent Redis knowledge directly to a production incident bridged theory and practice, reinforcing the value of staying up‑to‑date with underlying system behaviors.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Backend Development Redis Root Cause Analysis Cache Optimization

Written by

Selected Java Interview Questions

A professional Java tech channel sharing common knowledge to help developers fill gaps. Follow us!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.