Operations 8 min read

Why Did My Container’s CPU Spike? A Step‑by‑Step JVM GC Debugging Walkthrough

When a production container suddenly hit 90% CPU, the author traced the issue to abnormal JVM garbage‑collection activity, captured the problematic GC graphs, and used Linux tools, jstack, and code inspection to pinpoint a faulty Excel export loop that caused excessive GC and full‑GC cycles.

Java Architect Essentials

Apr 7, 2023

Why Did My Container’s CPU Spike? A Step‑by‑Step JVM GC Debugging Walkthrough

Preface

The author received an alarm on a Friday afternoon: a container’s CPU usage jumped above 90%. Monitoring showed a pod with 61 Young GC events and one Full GC within two hours, a rare and severe situation that prompted a deep dive.

Scenario

A normal GC curve (illustrated in the first image) shows very few collections under typical load.

Problematic JVM Monitoring Curve

The second image reveals a burst of GC activity, including a Full GC, indicating a serious issue.

Only one pod exhibited this abnormal behavior, so the investigation focused on that pod.

Detailed Analysis

Enter the pod and run top to view process resource usage. The screenshot shows low overall usage, but the Java process (PID 1) consumes over 130% CPU on a multi‑core node.

Run top -H -p <pid> to list threads and identify the one with the highest CPU usage. The thread IDs (TIDs) are shown in the next screenshot.

Convert the problematic TID (e.g., 746) to hexadecimal using printf "%x\n" 746, because stack traces display thread IDs in hex.

Run jstack <pid> | grep 2ea > gc.stack to extract the stack trace of the offending thread (hex ID 0x2ea) into a file. jstack pid | grep 2ea > gc.stack The jstack tool captures a snapshot of all Java threads; filtering by the hex ID isolates the problematic stack.

Because the container’s file system is limited, the author copied gc.stack to a jump‑host, started a simple HTTP server with python -m SimpleHTTPServer 8080, and downloaded the file via curl -o http://<ip>/gcInfo.stack. curl -o http://ip_address/gcInfo.stack After retrieving the stack locally, the author searched for the hex ID (2ea) and located the corresponding stack frame.

The stack pointed to an asynchronous Excel export routine that called a common list‑query API. The API returns at most 200 records per page, but the export attempted to process tens of thousands of rows, causing massive object allocation.

The export method used nested loops and repeatedly created new ArrayList instances, keeping them alive until the method completed, which triggered frequent GC and eventually a Full GC that impacted other pods.

The code was fixed by limiting the batch size and improving the list handling, then the change was hot‑deployed, resolving the CPU spike.

Conclusion

When facing production incidents, first ensure service availability, then iteratively analyze logs, metrics, and thread dumps to isolate the root cause; tools like top, jstack, and curl are invaluable for such investigations.

JVM Kubernetes Garbage Collection Troubleshooting jstack CPU Spike

Written by

Java Architect Essentials

Committed to sharing quality articles and tutorials to help Java programmers progress from junior to mid-level to senior architect. We curate high-quality learning resources, interview questions, videos, and projects from across the internet to help you systematically improve your Java architecture skills. Follow and reply '1024' to get Java programming resources. Learn together, grow together.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.