Analyzing and Resolving Unexpected JVM GC Spikes in Production

This article recounts a production incident where a container's CPU spiked due to abnormal JVM garbage collection, walks through step-by-step diagnostics using top, jstack, and thread analysis, explains the root cause in an Excel export routine, and concludes with a brief promotion of a ChatGPT community.

Top Architect
Top Architect
Top Architect
Analyzing and Resolving Unexpected JVM GC Spikes in Production

Preface

On a Friday while writing documentation, an alarm triggered showing CPU usage over 90%. The JVM monitoring revealed a pod that performed 61 Young GC and one Full GC within two hours, a rare and severe issue.

Scenario

The author first shows a normal GC curve (illustrated) and then the problematic GC curve, highlighting the abnormal frequency of GC events.

Detailed Analysis

The problem was isolated to a single pod among many. The troubleshooting steps were:

Enter the pod and run top to view process resource usage.

Identify the Java process (PID 1) consuming high CPU (130% on multi‑core).

Run top -H -p <pid> to list threads and find the thread ID (TID) with the highest CPU.

Convert the TID (e.g., 746) to hexadecimal using printf "%x\n" 746.

Extract the stack trace of that thread with jstack <pid> | grep 2ea > gc.stack.

Download the generated gc.stack file via a temporary Python HTTP server and curl for local analysis.

Search the stack file for the hexadecimal thread ID (2ea) to locate the offending code.

The investigation revealed that the issue originated from an asynchronous Excel export feature that reused a common list query interface limited to 200 items per page, while the export required tens of thousands of records. The nested loops and repeated List allocations caused massive GC activity, eventually affecting other pods.

The fix involved refactoring the export logic to avoid excessive object creation and redeploying the corrected code, which resolved the CPU and GC spikes.

Conclusion

When encountering production problems, first ensure service availability, then methodically analyze limited information layers to pinpoint the root cause. Familiarity with tools like arthas can further simplify troubleshooting.

Note: The article also includes a promotional segment inviting readers to join a ChatGPT community offering resources, mentorship, and exclusive benefits.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

JVMperformanceDockerGarbage Collectiontroubleshooting
Top Architect
Written by

Top Architect

Top Architect focuses on sharing practical architecture knowledge, covering enterprise, system, website, large‑scale distributed, and high‑availability architectures, plus architecture adjustments using internet technologies. We welcome idea‑driven, sharing‑oriented architects to exchange and learn together.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.