How I Resolved a 13‑Hour OOM Nightmare in a Spring Boot Service

The article walks through a 13‑hour out‑of‑memory incident on a Spring Boot 2.7 service running in Kubernetes, detailing how to preserve the crash dump, interpret GC logs, use MAT and Arthas to pinpoint a static HashMap leak, and apply both temporary and permanent fixes while hardening the system for future safety.

Java Web Project
Java Web Project
Java Web Project
How I Resolved a 13‑Hour OOM Nightmare in a Spring Boot Service

Incident Overview

At 02:00 AM an OOM kill hit a Spring Boot 2.7 service (JDK 11, G1GC) deployed on Kubernetes with a 2 GB memory limit. The author spent 13 hours (until 15:00 PM) reproducing the whole troubleshooting process.

Step 1 – Preserve the Crash Dump Before Restart

The first instinct is to restart the pod, but that destroys the OOM snapshot. The correct order is:

Dump the live heap before K8s kills the pod:

# In the pod, run
kubectl exec -it order-service-7d9f8b-xk2p9 -- \
  jmap -dump:live,format=b,file=/tmp/heap-$(date +%Y%m%d%H%M).hprof 1

# Copy the dump out (the file disappears after restart)
kubectl cp order-service-7d9f8b-xk2p9:/tmp/heap-202405091402.hprof ./heap.hprof

Inspect the Kubernetes event log to confirm the OOM reason:

# kubectl describe pod order-service-7d9f8b-xk2p9
# Look for:
#   Reason: OOMKilled
#   Exit Code: 137 (128+9, SIGKILL)

Read the GC log (if enabled) for the last minutes:

# kubectl logs order-service-7d9f8b-xk2p9 --previous | grep -E "GC|OutOfMemory" | tail -50

Step 2 – Understand JVM Memory Layout

Memory is divided into:

Young Generation (Eden, S0, S1) – short‑lived objects.

Old Generation – long‑lived objects; the typical OOM hotspot.

Metaspace – class metadata.

Code Cache , Thread Stack , Direct Memory – non‑heap regions.

The current OOM error was Java heap space, pointing to the Old Generation.

Step 3 – Analyze GC Logs to Distinguish Leak vs. Insufficient Memory

Full GC events showed almost no heap reduction:

[2024-05-09T01:47:33.241+0800] GC(1823) Pause Full (Allocation Failure) 1948M->1951M (2048M) 8.234s
[2024-05-09T01:47:41.488+0800] GC(1824) Pause Full (Allocation Failure) 1951M->1952M (2048M) 9.102s
[2024-05-09T01:47:50.602+0800] GC(1825) Pause Full (Allocation Failure) 1952M->1952M (2048M) 11.847s
[2024-05-09T01:47:50.602+0800] OutOfMemoryError: Java heap space

Because the heap size stayed around 1950 MB after each Full GC, the author concluded a memory leak rather than a simple capacity issue.

Step 4 – Use Eclipse Memory Analyzer (MAT) to Find the Leak

After opening the 1.8 GB .hprof file in MAT, the Leak Suspects Report highlighted a static java.util.HashMap retaining 1.6 GB (81 % of the heap):

Problem 1:
  One instance of "java.util.HashMap" loaded by "jdk.internal.loader.ClassLoaders$AppClassLoader" occupies 1,638,940,672 (81.07%) bytes.
  → com.example.order.cache.LocalCacheManager
  → orderCacheMap (static field)

Using the Dominator Tree and sorting by retained heap confirmed the culprit:

Class Name                     Shallow Heap   Retained Heap
java.lang.Thread @ main               48 B      1,721 MB
  └─ com.example.order.cache.LocalCacheManager   32 B      1,638 MB
      └─ orderCacheMap: java.util.HashMap        64 B      1,638 MB
          ├─ [entry] orderId=10000001 … 2.1 KB
          ├─ [entry] orderId=10000002 … 2.1 KB
          └─ … (total 820,000 entries)

MAT’s OQL query further verified the entry count:

SELECT count(*) FROM java.util.HashMap$Entry e WHERE e.@GCRootInfo != null;
SELECT TOP 10 * FROM java.util.HashMap$Entry e ORDER BY e.@retainedHeapSize DESC;

Step 5 – Locate the Root Cause in Source Code

The offending class was a newly added local‑cache component:

@Component
public class LocalCacheManager {
    // Problem: static HashMap never cleared
    private static final Map<String, OrderDTO> orderCacheMap = new HashMap<>();

    @Autowired
    private OrderRepository orderRepository;

    public OrderDTO getOrder(String orderId) {
        if (orderCacheMap.containsKey(orderId)) {
            return orderCacheMap.get(orderId);
        }
        OrderDTO order = orderRepository.findById(orderId);
        orderCacheMap.put(orderId, order); // ← only inserts, never evicts
        return order;
    }
}

Four facts were identified:

The map is static final, living as long as the JVM.

Every new order adds an entry.

No eviction logic exists.

Order IDs grew unbounded, reaching 820 k entries (≈1.6 GB) after three weeks.

Step 6 – Auxiliary Diagnostic Tools

Arthas (no restart needed) was used to read the map size and watch the getter:

# Download Arthas
curl -O https://arthas.aliyun.com/arthas-boot.jar
java -jar arthas-boot.jar

# Check size
ognl '@[email protected]()'   # → 820143

# Watch method calls
watch com.example.order.cache.LocalCacheManager getOrder "{params,returnObj}" -x 2

jstat gave real‑time GC statistics, confirming Old Gen saturation and frequent Full GC:

# Print GC stats every second
jstat -gcutil $(pgrep -f order-service) 1000
# Example output (relevant columns):
#   O (Old Gen) = 98.71% → almost full
#   FGC = 127 → Full GC count
#   FGCT = 1842 s → total Full GC time

VisualVM / JConsole were configured via JMX port‑forwarding for graphical monitoring.

Step 7 – Fixes

Temporary Fix (Arthas Hot‑Patch)

# Clear the map without restarting
ognl '@[email protected]()'
# Verify
ognl '@[email protected]()'   # → 0
# Observe memory recovery
watch -n 2 'jmap -heap $(pgrep -f order-service) 2>/dev/null | grep used'

Permanent Fix – Replace Naked HashMap

Option 1: Use Caffeine with size limit and TTL:

@Component
public class LocalCacheManager {
    private final Cache<String, OrderDTO> orderCache = Caffeine.newBuilder()
        .maximumSize(10_000)
        .expireAfterWrite(5, TimeUnit.MINUTES)
        .recordStats()
        .build();

    public OrderDTO getOrder(String orderId) {
        return orderCache.get(orderId, id -> orderRepository.findById(id));
    }

    public CacheStats stats() { return orderCache.stats(); }
}

Dependency to add:

<dependency>
  <groupId>com.github.ben-manes.caffeine</groupId>
  <artifactId>caffeine</artifactId>
  <version>3.1.8</version>
</dependency>

Option 2: Spring Cache abstraction backed by Caffeine (more declarative):

@Configuration
@EnableCaching
public class CacheConfig {
    @Bean
    public CacheManager cacheManager() {
        CaffeineCacheManager manager = new CaffeineCacheManager("orders");
        manager.setCaffeine(Caffeine.newBuilder()
            .maximumSize(10_000)
            .expireAfterWrite(5, TimeUnit.MINUTES)
            .recordStats());
        return manager;
    }
}

@Service
public class OrderService {
    @Cacheable(value = "orders", key = "#orderId")
    public OrderDTO getOrder(String orderId) {
        return orderRepository.findById(orderId);
    }

    @CacheEvict(value = "orders", key = "#orderId")
    public void updateOrder(String orderId, OrderDTO dto) {
        orderRepository.save(dto);
    }
}

Option 3: Switch to Redis for distributed caching when data volume or multi‑instance sharing is required.

Step 8 – Post‑mortem Hardening

Configuration additions to guarantee future dump and observability:

-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/data/logs/heap-dump/
-Xlog:gc*:file=/data/logs/gc.log:time,uptime:filecount=5,filesize=20m
-XX:+UseG1GC

Monitoring alerts (Prometheus + Grafana) to catch early signs:

# Old Gen usage > 80%
alert: JvmOldGenHigh
expr: jvm_memory_used_bytes{area="heap",id="G1 Old Gen"} / jvm_memory_max_bytes{area="heap",id="G1 Old Gen"} > 0.8
for: 5m
summary: "Old generation usage exceeds 80%, possible memory leak"

# Full GC frequency > 0.5 per minute
alert: JvmFullGCFrequent
expr: rate(jvm_gc_pause_seconds_count{action="end of major GC"}[5m]) > 0.5
for: 3m
summary: "Full GC frequency too high (>30 per hour)"

# GC time ratio > 10%
alert: JvmGCTimeHigh
expr: rate(jvm_gc_pause_seconds_sum[5m]) / 60 > 0.1
for: 5m
summary: "GC time >10% of total time, service latency may suffer"

Code‑review checklist was updated to forbid naked HashMap / ConcurrentHashMap as caches and require size limits/TTL.

Takeaways

Never restart immediately; preserve the heap dump.

Enable persistent GC logs; they are essential for leak detection.

Static collections without eviction are the most common source of Java memory leaks.

MAT’s Leak Suspects Report quickly points to the offender; Dominator Tree and OQL give the exact path.

Arthas OGNL can read live state and apply hot‑patches, buying time for a proper code fix.

After applying the new cache implementation and monitoring rules, the service has not experienced another OOM in six months.

JavaJVMKubernetesMemory LeakSpring BootArthasOOMMAT
Java Web Project
Written by

Java Web Project

Focused on Java backend technologies, trending internet tech, and the latest industry developments. The platform serves over 200,000 Java developers, inviting you to learn and exchange ideas together. Check the menu for Java learning resources.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.