How I Resolved a 13‑Hour OOM Nightmare in a Spring Boot Service
The article walks through a 13‑hour out‑of‑memory incident on a Spring Boot 2.7 service running in Kubernetes, detailing how to preserve the crash dump, interpret GC logs, use MAT and Arthas to pinpoint a static HashMap leak, and apply both temporary and permanent fixes while hardening the system for future safety.
Incident Overview
At 02:00 AM an OOM kill hit a Spring Boot 2.7 service (JDK 11, G1GC) deployed on Kubernetes with a 2 GB memory limit. The author spent 13 hours (until 15:00 PM) reproducing the whole troubleshooting process.
Step 1 – Preserve the Crash Dump Before Restart
The first instinct is to restart the pod, but that destroys the OOM snapshot. The correct order is:
Dump the live heap before K8s kills the pod:
# In the pod, run
kubectl exec -it order-service-7d9f8b-xk2p9 -- \
jmap -dump:live,format=b,file=/tmp/heap-$(date +%Y%m%d%H%M).hprof 1
# Copy the dump out (the file disappears after restart)
kubectl cp order-service-7d9f8b-xk2p9:/tmp/heap-202405091402.hprof ./heap.hprofInspect the Kubernetes event log to confirm the OOM reason:
# kubectl describe pod order-service-7d9f8b-xk2p9
# Look for:
# Reason: OOMKilled
# Exit Code: 137 (128+9, SIGKILL)Read the GC log (if enabled) for the last minutes:
# kubectl logs order-service-7d9f8b-xk2p9 --previous | grep -E "GC|OutOfMemory" | tail -50Step 2 – Understand JVM Memory Layout
Memory is divided into:
Young Generation (Eden, S0, S1) – short‑lived objects.
Old Generation – long‑lived objects; the typical OOM hotspot.
Metaspace – class metadata.
Code Cache , Thread Stack , Direct Memory – non‑heap regions.
The current OOM error was Java heap space, pointing to the Old Generation.
Step 3 – Analyze GC Logs to Distinguish Leak vs. Insufficient Memory
Full GC events showed almost no heap reduction:
[2024-05-09T01:47:33.241+0800] GC(1823) Pause Full (Allocation Failure) 1948M->1951M (2048M) 8.234s
[2024-05-09T01:47:41.488+0800] GC(1824) Pause Full (Allocation Failure) 1951M->1952M (2048M) 9.102s
[2024-05-09T01:47:50.602+0800] GC(1825) Pause Full (Allocation Failure) 1952M->1952M (2048M) 11.847s
[2024-05-09T01:47:50.602+0800] OutOfMemoryError: Java heap spaceBecause the heap size stayed around 1950 MB after each Full GC, the author concluded a memory leak rather than a simple capacity issue.
Step 4 – Use Eclipse Memory Analyzer (MAT) to Find the Leak
After opening the 1.8 GB .hprof file in MAT, the Leak Suspects Report highlighted a static java.util.HashMap retaining 1.6 GB (81 % of the heap):
Problem 1:
One instance of "java.util.HashMap" loaded by "jdk.internal.loader.ClassLoaders$AppClassLoader" occupies 1,638,940,672 (81.07%) bytes.
→ com.example.order.cache.LocalCacheManager
→ orderCacheMap (static field)Using the Dominator Tree and sorting by retained heap confirmed the culprit:
Class Name Shallow Heap Retained Heap
java.lang.Thread @ main 48 B 1,721 MB
└─ com.example.order.cache.LocalCacheManager 32 B 1,638 MB
└─ orderCacheMap: java.util.HashMap 64 B 1,638 MB
├─ [entry] orderId=10000001 … 2.1 KB
├─ [entry] orderId=10000002 … 2.1 KB
└─ … (total 820,000 entries)MAT’s OQL query further verified the entry count:
SELECT count(*) FROM java.util.HashMap$Entry e WHERE e.@GCRootInfo != null;
SELECT TOP 10 * FROM java.util.HashMap$Entry e ORDER BY e.@retainedHeapSize DESC;Step 5 – Locate the Root Cause in Source Code
The offending class was a newly added local‑cache component:
@Component
public class LocalCacheManager {
// Problem: static HashMap never cleared
private static final Map<String, OrderDTO> orderCacheMap = new HashMap<>();
@Autowired
private OrderRepository orderRepository;
public OrderDTO getOrder(String orderId) {
if (orderCacheMap.containsKey(orderId)) {
return orderCacheMap.get(orderId);
}
OrderDTO order = orderRepository.findById(orderId);
orderCacheMap.put(orderId, order); // ← only inserts, never evicts
return order;
}
}Four facts were identified:
The map is static final, living as long as the JVM.
Every new order adds an entry.
No eviction logic exists.
Order IDs grew unbounded, reaching 820 k entries (≈1.6 GB) after three weeks.
Step 6 – Auxiliary Diagnostic Tools
Arthas (no restart needed) was used to read the map size and watch the getter:
# Download Arthas
curl -O https://arthas.aliyun.com/arthas-boot.jar
java -jar arthas-boot.jar
# Check size
ognl '@[email protected]()' # → 820143
# Watch method calls
watch com.example.order.cache.LocalCacheManager getOrder "{params,returnObj}" -x 2jstat gave real‑time GC statistics, confirming Old Gen saturation and frequent Full GC:
# Print GC stats every second
jstat -gcutil $(pgrep -f order-service) 1000
# Example output (relevant columns):
# O (Old Gen) = 98.71% → almost full
# FGC = 127 → Full GC count
# FGCT = 1842 s → total Full GC timeVisualVM / JConsole were configured via JMX port‑forwarding for graphical monitoring.
Step 7 – Fixes
Temporary Fix (Arthas Hot‑Patch)
# Clear the map without restarting
ognl '@[email protected]()'
# Verify
ognl '@[email protected]()' # → 0
# Observe memory recovery
watch -n 2 'jmap -heap $(pgrep -f order-service) 2>/dev/null | grep used'Permanent Fix – Replace Naked HashMap
Option 1: Use Caffeine with size limit and TTL:
@Component
public class LocalCacheManager {
private final Cache<String, OrderDTO> orderCache = Caffeine.newBuilder()
.maximumSize(10_000)
.expireAfterWrite(5, TimeUnit.MINUTES)
.recordStats()
.build();
public OrderDTO getOrder(String orderId) {
return orderCache.get(orderId, id -> orderRepository.findById(id));
}
public CacheStats stats() { return orderCache.stats(); }
}Dependency to add:
<dependency>
<groupId>com.github.ben-manes.caffeine</groupId>
<artifactId>caffeine</artifactId>
<version>3.1.8</version>
</dependency>Option 2: Spring Cache abstraction backed by Caffeine (more declarative):
@Configuration
@EnableCaching
public class CacheConfig {
@Bean
public CacheManager cacheManager() {
CaffeineCacheManager manager = new CaffeineCacheManager("orders");
manager.setCaffeine(Caffeine.newBuilder()
.maximumSize(10_000)
.expireAfterWrite(5, TimeUnit.MINUTES)
.recordStats());
return manager;
}
}
@Service
public class OrderService {
@Cacheable(value = "orders", key = "#orderId")
public OrderDTO getOrder(String orderId) {
return orderRepository.findById(orderId);
}
@CacheEvict(value = "orders", key = "#orderId")
public void updateOrder(String orderId, OrderDTO dto) {
orderRepository.save(dto);
}
}Option 3: Switch to Redis for distributed caching when data volume or multi‑instance sharing is required.
Step 8 – Post‑mortem Hardening
Configuration additions to guarantee future dump and observability:
-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/data/logs/heap-dump/
-Xlog:gc*:file=/data/logs/gc.log:time,uptime:filecount=5,filesize=20m
-XX:+UseG1GCMonitoring alerts (Prometheus + Grafana) to catch early signs:
# Old Gen usage > 80%
alert: JvmOldGenHigh
expr: jvm_memory_used_bytes{area="heap",id="G1 Old Gen"} / jvm_memory_max_bytes{area="heap",id="G1 Old Gen"} > 0.8
for: 5m
summary: "Old generation usage exceeds 80%, possible memory leak"
# Full GC frequency > 0.5 per minute
alert: JvmFullGCFrequent
expr: rate(jvm_gc_pause_seconds_count{action="end of major GC"}[5m]) > 0.5
for: 3m
summary: "Full GC frequency too high (>30 per hour)"
# GC time ratio > 10%
alert: JvmGCTimeHigh
expr: rate(jvm_gc_pause_seconds_sum[5m]) / 60 > 0.1
for: 5m
summary: "GC time >10% of total time, service latency may suffer"Code‑review checklist was updated to forbid naked HashMap / ConcurrentHashMap as caches and require size limits/TTL.
Takeaways
Never restart immediately; preserve the heap dump.
Enable persistent GC logs; they are essential for leak detection.
Static collections without eviction are the most common source of Java memory leaks.
MAT’s Leak Suspects Report quickly points to the offender; Dominator Tree and OQL give the exact path.
Arthas OGNL can read live state and apply hot‑patches, buying time for a proper code fix.
After applying the new cache implementation and monitoring rules, the service has not experienced another OOM in six months.
Java Web Project
Focused on Java backend technologies, trending internet tech, and the latest industry developments. The platform serves over 200,000 Java developers, inviting you to learn and exchange ideas together. Check the menu for Java learning resources.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
