Backend Development 8 min read

Why Did Our Java Service Crash with OOM? A Deep Dive into Root Causes and Fixes

An online service experienced severe latency due to massive GAP times, leading to repeated OutOfMemoryErrors; by analyzing monitoring data, JVM dumps, and SQL queries, the team uncovered a massive userId array causing a 1 GB count query, then implemented request limits and JVM flags to prevent recurrence.

Su San Talks Tech

Mar 10, 2023

Why Did Our Java Service Crash with OOM? A Deep Dive into Root Causes and Fixes

Phenomenon

Online service endpoints became extremely slow; monitoring showed a large GAP time even though the actual request processing time was short, and many such requests occurred.

Root Cause Analysis

Monitoring indicated that requests reached the service but waited about 3 seconds before processing. CPU spikes and frequent, long GC events coincided with the slow periods, and the pod was eventually killed due to a full heap.

Logs showed an OOM error, but the stack trace did not reveal the root cause:

system error: org.springframework.web.util.NestedServletException: Handler dispatch failed; nested exception is java.lang.OutOfMemoryError: Java heap space
    at org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:1055)
    at org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:943)
    ...

A large batch job was running at the time, but its code showed no obvious issue.

Even after adding JVM parameters for heap dumps, the container killed the pod before the dump could be saved.

-XX:+HeapDumpOnOutOfMemoryError -XX:ErrorFile=/logs/oom_dump/xxx.log -XX:HeapDumpPath=/logs/oom_dump/xxx.hprof

Further investigation revealed two OOM events; an EFS volume was mounted to capture dump files.

Analyzing the 4.8 GB heap dump with jvisualvm identified the offending thread and a massive count SQL query that allocated over 1 GB of memory.

The query operated on a byte array of 1.07 GB and a char array of 1.03 GB, both generated by a count statement.

The userId array passed to the service was 64 MB, originating from an external system that mistakenly sent all user IDs in a single request.

Solution

The upstream system was fixed to limit the number of userId values sent. Additionally, the service added its own guard to restrict the size of incoming userId collections.

Additional Note

A similar OOM incident occurred later, triggered by full‑table queries without WHERE clauses. Heap dumps (up to 12 GB) revealed huge String objects. The root cause was a TiDB query that loaded the entire user table into memory.

Slow‑query logs from TiDB confirmed the problematic query.

Summary

When facing OOM issues without obvious code bugs, the following JVM options are valuable, especially in containerized environments:

-XX:+HeapDumpOnOutOfMemoryError -XX:ErrorFile=/logs/oom_dump/xxx.log -XX:HeapDumpPath=/logs/oom_dump/xxx.hprof

Additionally, enable the JVM to exit on OOM so that Kubernetes can quickly restart a fresh instance: -XX:+ExitOnOutOfMemoryError For SQL statements lacking a WHERE clause, enforce a sensible LIMIT to prevent full‑table scans from exhausting memory.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

backend Java JVM Database OutOfMemoryError

Written by

Su San Talks Tech

Su San, former staff at several leading tech companies, is a top creator on Juejin and a premium creator on CSDN, and runs the free coding practice site www.susan.net.cn.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.