Backend Development 16 min read

Why Did Our Redis‑Driven Service OOM? A Deep Dive into JVM Memory and GC

The article walks through a real‑world OOM incident in a high‑traffic hotel information service, detailing the root‑cause analysis of memory exhaustion, JVM heap configuration, GC behavior, heap‑dump inspection, and the concrete optimizations applied to prevent similar failures.

Qunar Tech Salon

Nov 21, 2025

Why Did Our Redis‑Driven Service OOM? A Deep Dive into JVM Memory and GC

Problem Overview

During a load‑test that simulated three historical MySQL/Redis timeout scenarios, the hotel basic‑info service experienced a 2.5× traffic spike with 30 ms Redis timeouts. The JVM memory usage surged to 96 % and the container was OOM‑Killed, causing an automatic restart.

Investigation Steps

Identify OOM type via error logs.

Analyze GC logs.

Perform heap dump analysis with MAT.

Trace code paths that generate the load.

JVM Memory Layout

JVM Memory
├── Heap
│   ├── Young Generation
│   │   ├── Eden
│   │   └── Survivor (From/To)
│   └── Old Generation
├── Non‑Heap
│   ├── Metaspace
│   ├── VM Stack
│   ├── Native Method Stack
│   └── Program Counter
└── Direct Memory (outside heap)

Default ratios after JDK 8 are -XX:NewRatio=2 (young : old = 1 : 2) and -XX:SurvivorRatio=8 (Eden : Survivor = 8 : 1).

JVM Configuration Used in the Test

-Xms6144M -Xmx6144M
-XX:+UseG1GC
-XX:MaxGCPauseMillis=200
-XX:+HeapDumpOnOutOfMemoryError
-XX:MetaspaceSize=512m
-XX:MaxMetaspaceSize=512m
-XX:MaxDirectMemorySize=1024M

JDK version: 11. No explicit -XX:NewRatio or -XX:SurvivorRatio were set, so defaults applied.

GC Analysis

The application used G1GC. Under high load, Young GC cleared the Eden region (617 → 0) while Old Region grew (1885 → 1905), indicating continuous promotion of objects. Mixed GC later reclaimed some Old Regions, but the Old Generation remained the dominant memory consumer.

GC logs showed many Full GC cycles (114) with negligible heap size reduction, confirming that the Old Generation was exhausted and Full GC could not free memory.

Heap Dump Inspection

MAT analysis revealed that java.util.concurrent.ScheduledThreadPoolExecutor and its unbounded DelayedWorkQueue occupied ~2.5 GB, the largest memory consumer. The queue kept growing because Redis timeouts caused cache‑update tasks to pile up, creating a memory leak.

Root‑Cause and Recommendations

Unbounded thread‑pool queue caused unlimited object retention during Redis timeouts.

Old Generation space was insufficient for the promoted objects.

Suggested mitigations:

Replace the unbounded queue with a bounded one (e.g., capacity 1024) and drop excess tasks.

Introduce a Redis degradation switch to disable cache updates when Redis is unavailable.

Monitor Old Region growth and tune G1GC parameters.

Follow‑Up Optimizations

The team implemented a bounded queue and added a Sentinel‑controlled Redis switch. Both changes have been deployed to production.

Knowledge Sharing

These findings were propagated across business domains to enforce bounded‑queue coding standards and Redis degradation design guidelines, reducing the risk of OOM under high‑concurrency timeout scenarios.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Java JVM Performance Redis memory gc OOM

Written by

Qunar Tech Salon

Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.