Operations 7 min read

How I Traced a Sudden CPU Spike to JVM GC Issues in a Container

After receiving an alarm that a production container’s CPU usage surged past 90%, I investigated the JVM metrics, discovered excessive young and full GCs in a single pod, and walked through the detailed troubleshooting steps—including top, thread analysis, jstack, and code fixes—that resolved the issue.

Java Backend Technology
Java Backend Technology
Java Backend Technology
How I Traced a Sudden CPU Spike to JVM GC Issues in a Container

Background

On Friday, while writing documentation, an alarm indicated a container’s CPU usage exceeded 90%. Monitoring showed a pod generating 61 young GCs and one full GC within two hours, a rare and serious issue.

Normal JVM Monitoring Curve

Typical JVM metrics show minimal GC activity.

Problematic JVM Monitoring Curve

The problematic pod exhibits frequent GC events, including a full GC.

Detailed Analysis

Enter the pod and run top to view Linux process resource usage.

Identify the Java process (PID 1) consuming high CPU (130% on multi‑core).

Run top -H -p <pid> to find the thread ID (tid) with the highest CPU usage.

Convert the tid (e.g., 746) to hexadecimal using printf "%x\n" 746 because thread IDs appear in hex in stack traces.

Execute jstack <pid> | grep <hex_tid> > gc.stack to extract the stack trace of the offending thread.

Download the gc.stack file via a temporary Python HTTP server and curl to a local machine for easier inspection.

Search the stack trace for the method containing the hex tid (e.g., 2ea) and locate the corresponding implementation in the source code.

Discover that the Excel export feature reuses a common list‑query API, which paginates only 200 records per batch, while the export may request tens of thousands of records, leading to nested loops and excessive object creation.

Fix the code by avoiding the shared list and optimizing the export logic, then redeploy the fix, which eliminates the GC spikes.

Conclusion

The incident was frightening at first, but systematic troubleshooting—checking service availability, isolating the affected pod, and analyzing JVM metrics—allowed a quick resolution. The experience reinforced the importance of staying calm, investigating thoroughly, and promptly fixing performance‑critical code.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

JavaJVMKubernetesgcPerformance debuggingCPU Spike
Java Backend Technology
Written by

Java Backend Technology

Focus on Java-related technologies: SSM, Spring ecosystem, microservices, MySQL, MyCat, clustering, distributed systems, middleware, Linux, networking, multithreading. Occasionally cover DevOps tools like Jenkins, Nexus, Docker, and ELK. Also share technical insights from time to time, committed to Java full-stack development!

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.