Why Did Our Redis‑Driven Service OOM? A Deep Dive into JVM Memory and GC
The article walks through a real‑world OOM incident in a high‑traffic hotel information service, detailing the root‑cause analysis of memory exhaustion, JVM heap configuration, GC behavior, heap‑dump inspection, and the concrete optimizations applied to prevent similar failures.
Problem Overview
During a load‑test that simulated three historical MySQL/Redis timeout scenarios, the hotel basic‑info service experienced a 2.5× traffic spike with 30 ms Redis timeouts. The JVM memory usage surged to 96 % and the container was OOM‑Killed, causing an automatic restart.
Investigation Steps
Identify OOM type via error logs.
Analyze GC logs.
Perform heap dump analysis with MAT.
Trace code paths that generate the load.
JVM Memory Layout
JVM Memory
├── Heap
│ ├── Young Generation
│ │ ├── Eden
│ │ └── Survivor (From/To)
│ └── Old Generation
├── Non‑Heap
│ ├── Metaspace
│ ├── VM Stack
│ ├── Native Method Stack
│ └── Program Counter
└── Direct Memory (outside heap)Default ratios after JDK 8 are -XX:NewRatio=2 (young : old = 1 : 2) and -XX:SurvivorRatio=8 (Eden : Survivor = 8 : 1).
JVM Configuration Used in the Test
-Xms6144M -Xmx6144M
-XX:+UseG1GC
-XX:MaxGCPauseMillis=200
-XX:+HeapDumpOnOutOfMemoryError
-XX:MetaspaceSize=512m
-XX:MaxMetaspaceSize=512m
-XX:MaxDirectMemorySize=1024MJDK version: 11. No explicit -XX:NewRatio or -XX:SurvivorRatio were set, so defaults applied.
GC Analysis
The application used G1GC. Under high load, Young GC cleared the Eden region (617 → 0) while Old Region grew (1885 → 1905), indicating continuous promotion of objects. Mixed GC later reclaimed some Old Regions, but the Old Generation remained the dominant memory consumer.
GC logs showed many Full GC cycles (114) with negligible heap size reduction, confirming that the Old Generation was exhausted and Full GC could not free memory.
Heap Dump Inspection
MAT analysis revealed that java.util.concurrent.ScheduledThreadPoolExecutor and its unbounded DelayedWorkQueue occupied ~2.5 GB, the largest memory consumer. The queue kept growing because Redis timeouts caused cache‑update tasks to pile up, creating a memory leak.
Root‑Cause and Recommendations
Unbounded thread‑pool queue caused unlimited object retention during Redis timeouts.
Old Generation space was insufficient for the promoted objects.
Suggested mitigations:
Replace the unbounded queue with a bounded one (e.g., capacity 1024) and drop excess tasks.
Introduce a Redis degradation switch to disable cache updates when Redis is unavailable.
Monitor Old Region growth and tune G1GC parameters.
Follow‑Up Optimizations
The team implemented a bounded queue and added a Sentinel‑controlled Redis switch. Both changes have been deployed to production.
Knowledge Sharing
These findings were propagated across business domains to enforce bounded‑queue coding standards and Redis degradation design guidelines, reducing the risk of OOM under high‑concurrency timeout scenarios.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Qunar Tech Salon
Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
