Operations 23 min read

Comprehensive Guide to Java Runtime Error Diagnosis: CPU, Memory, Disk, GC, and Network Troubleshooting

This article provides a systematic, step‑by‑step guide for diagnosing Java production incidents by examining CPU, disk, memory, GC, and network layers, illustrating the use of tools such as jstack, jmap, jstat, vmstat, iostat, netstat, ss, and tcpdump with concrete command examples and visual aids.

Architecture Digest

Sep 19, 2020

Comprehensive Guide to Java Runtime Error Diagnosis: CPU, Memory, Disk, GC, and Network Troubleshooting

Online incidents typically involve CPU, disk, memory, and network problems, and most issues span multiple layers; therefore, a systematic check of all four aspects is recommended. Tools like jstack, jmap, jstat, vmstat, pidstat, iostat, netstat, ss, and tcpdump are essential for comprehensive troubleshooting.

CPU

CPU anomalies are often easier to locate. Common causes include business‑logic infinite loops, frequent GC, and excessive context switches. The most frequent culprit is business‑logic or framework code, which can be examined using jstack.

Analyzing CPU Issues with jstack

First, find the process ID (PID) with ps or top. Then identify high‑CPU threads using: top -H -p <pid> Convert the thread ID to hexadecimal:

printf '%x
' <tid>

Search the jstack output for the hexadecimal thread ID: jstack <pid> | grep '<nid>' -C5 --color Analyze the stack traces, focusing on WAITING and TIMED_WAITING states. A quick overview can be obtained with:

cat jstack.log | grep "java.lang.Thread.State" | sort -nr | uniq -c

Frequent GC

Check GC frequency with: jstat -gc <pid> 1000 The output shows generation sizes and GC timings (YGC/YGT, FGC/FGCT, etc.). If GC is too frequent, investigate memory leaks or adjust heap parameters.

Context Switches

Use vmstat to view the cs column (context switches). For a specific PID, run: pidstat -w <pid> Columns cswch and nvcswch indicate voluntary and involuntary switches.

Disk

Disk issues start with space checks using df -hl. Performance problems are diagnosed with: iostat -d -k -x Key columns include %util (disk utilization), rrqm/s and wrqm/s (read/write merges). Identify the offending process with iotop or by converting a thread ID to a PID via readlink -f /proc/*/task/<tid>/../.., then inspect I/O details with cat /proc/<pid>/io or lsof -p <pid>.

Memory

Memory troubleshooting begins with free. Most problems are heap‑related, manifesting as OOM or StackOverflow errors.

Heap OOM

Typical messages:

"Exception in thread \"main\" java.lang.OutOfMemoryError: unable to create new native thread" – insufficient native memory for thread stacks; consider reducing Xss or increasing OS limits.

"Exception in thread \"main\" java.lang.OutOfMemoryError: Java heap space" – heap reached -Xmx; look for leaks with jstack / jmap, then increase Xmx if necessary.

"Exception in thread \"main\" java.lang.OutOfMemoryError: Metaspace" – metaspace limit reached; adjust XX:MaxMetaspaceSize (or XX:MaxPermSize for pre‑1.8).

StackOverflow

Indicates thread stack size exceeded Xss. Reduce recursion depth or increase Xss cautiously.

Analyzing Heap Dumps

Export a heap dump with: jmap -dump:format=b,file=heap.hprof <pid> Open the dump in Eclipse MAT, use Leak Suspects or Top Consumers to locate leaks, and examine thread overviews for concurrency issues.

Off‑Heap Memory

Off‑heap leaks often stem from NIO buffers. Monitor with pmap -x <pid> | sort -rn -k3 | head -30. If growth persists, capture a native memory dump via:

gdb --batch --pid <pid> -ex "dump memory filename.dump <addr> <addr+size>"

Analyze with hexdump -C filename | less. Enable Native Memory Tracking (NMT) with JVM flags: -XX:NativeMemoryTracking=summary or -XX:NativeMemoryTracking=detail. Establish a baseline: jcmd <pid> VM.native_memory baseline Later, compare with jcmd <pid> VM.native_memory detail.diff.

GC Issues

GC problems can cause CPU spikes and latency. Enable detailed GC logging with:

-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps

For G1, monitor Young GC frequency and duration. If Young GC is too frequent, consider adjusting -Xmn or -XX:SurvivorRatio. Long Young GC pauses require analysis of Root Scanning, Object Copy, and Ref Proc phases.

Full GC Triggers

Concurrent phase failure – increase heap size or -XX:ConcGCThreads.

Promotion failure – adjust -XX:G1ReservePercent or -XX:InitiatingHeapOccupancyPercent.

Large object allocation failure – increase -XX:G1HeapRegionSize.

Explicit System.gc() calls – avoid.

Dump heap before/after Full GC with:

jinfo -flag +HeapDumpBeforeFullGC <pid>

jinfo -flag +HeapDumpAfterFullGC <pid>

Network

Network problems are complex and often the hardest to diagnose. Key topics include timeouts, TCP queue overflows, RST packets, TIME_WAIT, and CLOSE_WAIT states.

Timeouts

Distinguish between connection timeout, read/write timeout, and pool‑related timeouts. Keep client‑side timeouts shorter than server‑side values.

TCP Queue Overflow

Two queues exist: SYN (half‑open) and accept (full‑open). When the accept queue is full, the kernel may send RST packets. Monitor with: netstat -s | egrep "listen|LISTEN" (shows overflow counts) and ss -lnt (shows queue lengths).

RST Packets

RST indicates an abnormal connection reset. Common causes: port not listening, intentional reset via SO_LINGER, or stray packets after a connection is closed. Capture with: tcpdump -i en0 tcp -w capture.cap and analyze in Wireshark.

TIME_WAIT and CLOSE_WAIT

TIME_WAIT

helps recycle sockets safely; excessive counts can be mitigated by enabling reuse:

net.ipv4.tcp_tw_reuse = 1

net.ipv4.tcp_tw_recycle = 1

CLOSE_WAIT

usually indicates application code failing to close sockets properly, often due to blocked threads. Use jstack to locate stuck threads.

Overall, systematic use of the above commands and careful analysis of logs and dumps enable effective diagnosis of Java runtime errors across CPU, memory, disk, GC, and network layers.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Java network troubleshooting CPU

Written by

Architecture Digest

Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.