Why Did My Container’s CPU Spike? A Step‑by‑Step JVM GC Debugging Walkthrough
When a production container suddenly hit 90% CPU, the author traced the issue to abnormal JVM garbage‑collection activity, captured the problematic GC graphs, and used Linux tools, jstack, and code inspection to pinpoint a faulty Excel export loop that caused excessive GC and full‑GC cycles.
Preface
The author received an alarm on a Friday afternoon: a container’s CPU usage jumped above 90%. Monitoring showed a pod with 61 Young GC events and one Full GC within two hours, a rare and severe situation that prompted a deep dive.
Scenario
A normal GC curve (illustrated in the first image) shows very few collections under typical load.
Problematic JVM Monitoring Curve
The second image reveals a burst of GC activity, including a Full GC, indicating a serious issue.
Only one pod exhibited this abnormal behavior, so the investigation focused on that pod.
Detailed Analysis
Enter the pod and run top to view process resource usage. The screenshot shows low overall usage, but the Java process (PID 1) consumes over 130% CPU on a multi‑core node.
Run top -H -p <pid> to list threads and identify the one with the highest CPU usage. The thread IDs (TIDs) are shown in the next screenshot.
Convert the problematic TID (e.g., 746) to hexadecimal using printf "%x\n" 746, because stack traces display thread IDs in hex.
Run jstack <pid> | grep 2ea > gc.stack to extract the stack trace of the offending thread (hex ID 0x2ea) into a file. jstack pid | grep 2ea > gc.stack The jstack tool captures a snapshot of all Java threads; filtering by the hex ID isolates the problematic stack.
Because the container’s file system is limited, the author copied gc.stack to a jump‑host, started a simple HTTP server with python -m SimpleHTTPServer 8080, and downloaded the file via curl -o http://<ip>/gcInfo.stack. curl -o http://ip_address/gcInfo.stack After retrieving the stack locally, the author searched for the hex ID (2ea) and located the corresponding stack frame.
The stack pointed to an asynchronous Excel export routine that called a common list‑query API. The API returns at most 200 records per page, but the export attempted to process tens of thousands of rows, causing massive object allocation.
The export method used nested loops and repeatedly created new ArrayList instances, keeping them alive until the method completed, which triggered frequent GC and eventually a Full GC that impacted other pods.
The code was fixed by limiting the batch size and improving the list handling, then the change was hot‑deployed, resolving the CPU spike.
Conclusion
When facing production incidents, first ensure service availability, then iteratively analyze logs, metrics, and thread dumps to isolate the root cause; tools like top, jstack, and curl are invaluable for such investigations.
Java Architect Essentials
Committed to sharing quality articles and tutorials to help Java programmers progress from junior to mid-level to senior architect. We curate high-quality learning resources, interview questions, videos, and projects from across the internet to help you systematically improve your Java architecture skills. Follow and reply '1024' to get Java programming resources. Learn together, grow together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
