Why Did Our Java Service Crash with OOM? A Deep Dive into Root Causes and Fixes
An online service experienced severe latency due to massive GAP times, leading to repeated OutOfMemoryErrors; by analyzing monitoring data, JVM dumps, and SQL queries, the team uncovered a massive userId array causing a 1 GB count query, then implemented request limits and JVM flags to prevent recurrence.
Phenomenon
Online service endpoints became extremely slow; monitoring showed a large GAP time even though the actual request processing time was short, and many such requests occurred.
Root Cause Analysis
Monitoring indicated that requests reached the service but waited about 3 seconds before processing. CPU spikes and frequent, long GC events coincided with the slow periods, and the pod was eventually killed due to a full heap.
Logs showed an OOM error, but the stack trace did not reveal the root cause:
system error: org.springframework.web.util.NestedServletException: Handler dispatch failed; nested exception is java.lang.OutOfMemoryError: Java heap space
at org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:1055)
at org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:943)
...A large batch job was running at the time, but its code showed no obvious issue.
Even after adding JVM parameters for heap dumps, the container killed the pod before the dump could be saved.
-XX:+HeapDumpOnOutOfMemoryError -XX:ErrorFile=/logs/oom_dump/xxx.log -XX:HeapDumpPath=/logs/oom_dump/xxx.hprofFurther investigation revealed two OOM events; an EFS volume was mounted to capture dump files.
Analyzing the 4.8 GB heap dump with jvisualvm identified the offending thread and a massive count SQL query that allocated over 1 GB of memory.
The query operated on a byte array of 1.07 GB and a char array of 1.03 GB, both generated by a count statement.
The userId array passed to the service was 64 MB, originating from an external system that mistakenly sent all user IDs in a single request.
Solution
The upstream system was fixed to limit the number of userId values sent. Additionally, the service added its own guard to restrict the size of incoming userId collections.
Additional Note
A similar OOM incident occurred later, triggered by full‑table queries without WHERE clauses. Heap dumps (up to 12 GB) revealed huge String objects. The root cause was a TiDB query that loaded the entire user table into memory.
Slow‑query logs from TiDB confirmed the problematic query.
Summary
When facing OOM issues without obvious code bugs, the following JVM options are valuable, especially in containerized environments:
-XX:+HeapDumpOnOutOfMemoryError -XX:ErrorFile=/logs/oom_dump/xxx.log -XX:HeapDumpPath=/logs/oom_dump/xxx.hprofAdditionally, enable the JVM to exit on OOM so that Kubernetes can quickly restart a fresh instance: -XX:+ExitOnOutOfMemoryError For SQL statements lacking a WHERE clause, enforce a sensible LIMIT to prevent full‑table scans from exhausting memory.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Su San Talks Tech
Su San, former staff at several leading tech companies, is a top creator on Juejin and a premium creator on CSDN, and runs the free coding practice site www.susan.net.cn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
