Comprehensive Guide to Java Runtime Error Diagnosis: CPU, Memory, Disk, GC, and Network Troubleshooting
This article presents a systematic approach to diagnosing and resolving Java runtime problems by examining CPU usage, disk I/O, memory consumption, garbage‑collection behavior, and network anomalies, offering practical commands, analysis techniques, and visual aids to pinpoint root causes in production environments.
Author: fredalxin
Source: https://fredal.xin/java-error-check
CPU
When troubleshooting CPU issues, start by locating the problematic process with ps, then identify high‑usage threads using top -H -p <pid>. Convert the thread ID to hexadecimal with printf '%x\n' <pid> to obtain the NID, and search the jstack output for that NID.
Typical CPU problems include infinite loops, frequent GC, and excessive context switches; jstack helps reveal threads in WAITING or TIMED_WAITING states. Use
cat jstack.log | grep "java.lang.Thread.State" | sort -nr | uniq -cto get an overview.
Frequent GC can be inspected with jstat -gc <pid> 1000, which reports survivor, Eden, old‑gen, and metaspace usage as well as GC counts and timings.
Context‑switch problems are examined via vmstat and per‑process monitoring with pidstat -w <pid>, where cs, nvcswch, and cswch indicate voluntary and involuntary switches.
Disk
Check disk space with df -hl. For performance issues, use iostat -d -k -x to view utilization, read/write rates, and the %util column to identify saturated disks.
Identify the process responsible for heavy I/O with iotop. Convert a thread ID to a PID via readlink -f /proc/*/task/<tid>/../.., then inspect its I/O statistics with cat /proc/<pid>/io or list open files with lsof -p <pid>.
Memory
Start with free to get an overview of RAM usage. Common heap‑related OOM errors include
java.lang.OutOfMemoryError: unable to create new native thread(insufficient native memory for thread stacks) and java.lang.OutOfMemoryError: Java heap space (heap reaches -Xmx limit). Metaspace OOM is indicated by java.lang.OutOfMemoryError: Metaspace and can be tuned via -XX:MaxMetaspaceSize.
StackOverflow errors arise when a thread’s stack exceeds the -Xss size; adjust -Xss or fix recursive code.
Use jmap -dump:format=b,file=heap.hprof <pid> to generate a heap dump, then analyze it with Eclipse MAT (Memory Analyzer Tool) to locate leak suspects or top consumers.
For off‑heap memory problems (e.g., DirectByteBuffer leaks), monitor resident memory growth with pmap -x <pid> | sort -rn -k3 | head -30, and capture raw memory with
gdb --batch --pid <pid> -ex "dump memory dump.bin <start> <end>". Enable native memory tracking via -XX:NativeMemoryTracking=summary or detail and use jcmd <pid> VM.native_memory to view breakdowns.
GC Issues
Enable detailed GC logging with
-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps. Analyze Young GC frequency (adjust -Xmn, -XX:SurvivorRatio) and duration (Root Scanning, Object Copy, Ref Proc) using G1 logs. Full GC triggers include concurrent phase failures, promotion failures, large object allocation failures, and explicit System.gc() calls; mitigate by increasing heap size, reserving memory ( -XX:G1ReservePercent), or tuning concurrent GC threads.
Dump heap before/after Full GC with -XX:HeapDumpPath=/path/dump.hprof and jinfo -flag +HeapDumpBeforeFullGC <pid>, then compare dumps to find objects preventing collection.
Network
Network problems are diverse; common categories include timeouts (connection vs. read/write), TCP queue overflows, and RST packets. Use netstat -s | egrep "listen|LISTEN" to view SYN and accept queue overflows, and ss -lnt to inspect listening sockets.
Adjust kernel parameters such as net.ipv4.tcp_tw_reuse=1 and net.ipv4.tcp_tw_recycle=1 to reuse TIME‑WAIT sockets, and tune somaxconn and tcp_max_syn_backlog for accept queue sizes.
RST packets indicate abnormal connection termination; capture them with tcpdump -i eth0 tcp -w capture.cap and analyze in Wireshark. TIME_WAIT sockets prevent premature reuse; excessive TIME_WAIT can be mitigated with the same kernel tweaks.
CLOSE_WAIT often results from applications failing to close sockets properly; investigate with jstack to find threads stuck in blocking calls (e.g., CountDownLatch.await()).
Overall, systematic use of OS utilities ( top, ps, vmstat, iostat, netstat, ss), Java tools ( jstack, jmap, jstat, jcmd), and memory analyzers (MAT) enables effective root‑cause analysis of Java production incidents.
-END-
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Top Architect
Top Architect focuses on sharing practical architecture knowledge, covering enterprise, system, website, large‑scale distributed, and high‑availability architectures, plus architecture adjustments using internet technologies. We welcome idea‑driven, sharing‑oriented architects to exchange and learn together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
