Comprehensive Guide to Java Runtime Error Diagnosis: CPU, Memory, Disk, GC, and Network Troubleshooting
This article provides a systematic, step‑by‑step guide for diagnosing Java production incidents by examining CPU, disk, memory, GC, and network layers, illustrating the use of tools such as jstack, jmap, jstat, vmstat, iostat, netstat, ss, and tcpdump with concrete command examples and visual aids.
Online incidents typically involve CPU, disk, memory, and network problems, and most issues span multiple layers; therefore, a systematic check of all four aspects is recommended. Tools like jstack, jmap, jstat, vmstat, pidstat, iostat, netstat, ss, and tcpdump are essential for comprehensive troubleshooting.
CPU
CPU anomalies are often easier to locate. Common causes include business‑logic infinite loops, frequent GC, and excessive context switches. The most frequent culprit is business‑logic or framework code, which can be examined using jstack.
Analyzing CPU Issues with jstack
First, find the process ID (PID) with ps or top. Then identify high‑CPU threads using: top -H -p <pid> Convert the thread ID to hexadecimal:
printf '%x
' <tid>Search the jstack output for the hexadecimal thread ID: jstack <pid> | grep '<nid>' -C5 --color Analyze the stack traces, focusing on WAITING and TIMED_WAITING states. A quick overview can be obtained with:
cat jstack.log | grep "java.lang.Thread.State" | sort -nr | uniq -cFrequent GC
Check GC frequency with: jstat -gc <pid> 1000 The output shows generation sizes and GC timings (YGC/YGT, FGC/FGCT, etc.). If GC is too frequent, investigate memory leaks or adjust heap parameters.
Context Switches
Use vmstat to view the cs column (context switches). For a specific PID, run: pidstat -w <pid> Columns cswch and nvcswch indicate voluntary and involuntary switches.
Disk
Disk issues start with space checks using df -hl. Performance problems are diagnosed with: iostat -d -k -x Key columns include %util (disk utilization), rrqm/s and wrqm/s (read/write merges). Identify the offending process with iotop or by converting a thread ID to a PID via readlink -f /proc/*/task/<tid>/../.., then inspect I/O details with cat /proc/<pid>/io or lsof -p <pid>.
Memory
Memory troubleshooting begins with free. Most problems are heap‑related, manifesting as OOM or StackOverflow errors.
Heap OOM
Typical messages:
"Exception in thread \"main\" java.lang.OutOfMemoryError: unable to create new native thread" – insufficient native memory for thread stacks; consider reducing Xss or increasing OS limits.
"Exception in thread \"main\" java.lang.OutOfMemoryError: Java heap space" – heap reached -Xmx; look for leaks with jstack / jmap, then increase Xmx if necessary.
"Exception in thread \"main\" java.lang.OutOfMemoryError: Metaspace" – metaspace limit reached; adjust XX:MaxMetaspaceSize (or XX:MaxPermSize for pre‑1.8).
StackOverflow
Indicates thread stack size exceeded Xss. Reduce recursion depth or increase Xss cautiously.
Analyzing Heap Dumps
Export a heap dump with: jmap -dump:format=b,file=heap.hprof <pid> Open the dump in Eclipse MAT, use Leak Suspects or Top Consumers to locate leaks, and examine thread overviews for concurrency issues.
Off‑Heap Memory
Off‑heap leaks often stem from NIO buffers. Monitor with pmap -x <pid> | sort -rn -k3 | head -30. If growth persists, capture a native memory dump via:
gdb --batch --pid <pid> -ex "dump memory filename.dump <addr> <addr+size>"Analyze with hexdump -C filename | less. Enable Native Memory Tracking (NMT) with JVM flags: -XX:NativeMemoryTracking=summary or -XX:NativeMemoryTracking=detail. Establish a baseline: jcmd <pid> VM.native_memory baseline Later, compare with jcmd <pid> VM.native_memory detail.diff.
GC Issues
GC problems can cause CPU spikes and latency. Enable detailed GC logging with:
-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCTimeStampsFor G1, monitor Young GC frequency and duration. If Young GC is too frequent, consider adjusting -Xmn or -XX:SurvivorRatio. Long Young GC pauses require analysis of Root Scanning, Object Copy, and Ref Proc phases.
Full GC Triggers
Concurrent phase failure – increase heap size or -XX:ConcGCThreads.
Promotion failure – adjust -XX:G1ReservePercent or -XX:InitiatingHeapOccupancyPercent.
Large object allocation failure – increase -XX:G1HeapRegionSize.
Explicit System.gc() calls – avoid.
Dump heap before/after Full GC with:
jinfo -flag +HeapDumpBeforeFullGC <pid> jinfo -flag +HeapDumpAfterFullGC <pid>Network
Network problems are complex and often the hardest to diagnose. Key topics include timeouts, TCP queue overflows, RST packets, TIME_WAIT, and CLOSE_WAIT states.
Timeouts
Distinguish between connection timeout, read/write timeout, and pool‑related timeouts. Keep client‑side timeouts shorter than server‑side values.
TCP Queue Overflow
Two queues exist: SYN (half‑open) and accept (full‑open). When the accept queue is full, the kernel may send RST packets. Monitor with: netstat -s | egrep "listen|LISTEN" (shows overflow counts) and ss -lnt (shows queue lengths).
RST Packets
RST indicates an abnormal connection reset. Common causes: port not listening, intentional reset via SO_LINGER, or stray packets after a connection is closed. Capture with: tcpdump -i en0 tcp -w capture.cap and analyze in Wireshark.
TIME_WAIT and CLOSE_WAIT
TIME_WAIThelps recycle sockets safely; excessive counts can be mitigated by enabling reuse:
net.ipv4.tcp_tw_reuse = 1 net.ipv4.tcp_tw_recycle = 1 CLOSE_WAITusually indicates application code failing to close sockets properly, often due to blocked threads. Use jstack to locate stuck threads.
Overall, systematic use of the above commands and careful analysis of logs and dumps enable effective diagnosis of Java runtime errors across CPU, memory, disk, GC, and network layers.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Architecture Digest
Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
