Operations 6 min read

Diagnosing High CPU and Frequent GC in a Java Container: A Step‑by‑Step Analysis

When a production container suddenly hit over 90% CPU and excessive JVM garbage collection, the author walks through entering the pod, using top and top‑H to locate the offending thread, extracting its stack with jstack, downloading the data via a simple HTTP server, and ultimately discovering an Excel export routine that caused massive object allocation, fixing the code and restoring stability.

Architecture Digest
Architecture Digest
Architecture Digest
Diagnosing High CPU and Frequent GC in a Java Container: A Step‑by‑Step Analysis

On a Friday while writing documentation, an alert indicated that a production container’s CPU usage spiked above 90%. JVM monitoring showed 61 Young GC events and one Full GC within two hours, a rare and serious situation that prompted an immediate investigation.

Normal JVM monitoring curves show infrequent GC, whereas the problematic container displayed a flood of GC activity, clearly visible in the provided charts.

The investigation began by entering the affected pod and running top to view process resource usage. The Java process (PID 1) was consuming 130% of CPU on a multi‑core node, confirming that the Java application was the culprit.

Next, top -H -p <pid> was used to list threads and identify the one with the highest CPU usage. The thread ID (tid) 746 was converted to hexadecimal with printf "%x\n" 746, yielding 0x2ea, which matches the thread identifier shown in JVM stack traces.

The stack of that thread was captured using jstack <pid> | grep 2ea >gc.stack. Because the file was large, a simple Python HTTP server was started ( python -m SimpleHTTPServer 8080) and the stack file was downloaded with curl -o http://<ip>/gcInfo.stack.

Analyzing the downloaded stack revealed that the issue originated from an asynchronous Excel export feature. The export routine called a common list‑query API limited to 200 items per page, but the export required tens of thousands of records per user, causing massive List allocations, repeated GC, and eventually a Full GC that impacted other pods.

After fixing the export logic to avoid the excessive list creation and redeploying the service, the CPU usage returned to normal and the GC storms ceased. The author concludes with advice to stay calm during production incidents, verify service availability first, and use tools like Arthas for deeper troubleshooting.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

JavaJVMKubernetestroubleshootingCPUgc
Architecture Digest
Written by

Architecture Digest

Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.