Comprehensive Guide to Diagnosing Java Production Issues: CPU, Disk, Memory, GC, and Network
This article provides a step‑by‑step troubleshooting guide for Java production incidents, covering CPU, disk, memory, GC, and network problems with practical commands, analysis techniques, and tools such as jstack, jmap, iostat, netstat, and native memory tracking.
Online Java service failures often involve multiple layers such as CPU, disk, memory, and network, so a systematic inspection of each aspect is recommended.
CPU
Identify high‑CPU processes with ps and top -H -p <pid>, convert the PID to hexadecimal ( printf '%x\n' pid) and locate the corresponding thread in jstack output ( jstack pid | grep 'nid' -C5 --color). Analyze WAITING and TIMED_WAITING threads using
cat jstack.log | grep "java.lang.Thread.State" | sort -nr | uniq -c.
Frequent GC
Check GC frequency with jstat -gc <pid> 1000; monitor Young/Full GC counts and times (YGC/YGT, FGC/FGCT, GCT) to decide if GC tuning is needed.
Context Switches
Use vmstat to view the cs column, or monitor a specific PID with pidstat -w <pid> (cswch/nvcswch).
Disk
Check filesystem space with df -hl and disk performance with iostat -d -k -x. Identify the responsible process using iotop, then map thread IDs to PIDs via readlink -f /proc/*/task/*/../... Inspect I/O stats with cat /proc/<pid>/io and open files with lsof -p <pid>.
Memory
Start with free to view overall memory usage. Common issues include OOM and StackOverflow:
Exception in thread "main" java.lang.OutOfMemoryError: unable to create new native thread– often caused by thread‑pool leaks; reduce Xss or increase OS limits.
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space– indicates heap exhaustion; look for leaks with jstack / jmap before adjusting -Xmx. Caused by: java.lang.OutOfMemoryError: Meta space – meta‑space overflow; tune -XX:MaxMetaspaceSize. Exception in thread "main" java.lang.StackOverflowError – stack size too small; adjust -Xss.
Generate heap dumps with jmap -dump:format=b,file=heap.hprof <pid> and analyze them using MAT (Memory Analyzer Tool) or jmap -histo:live <pid>. Enable native memory tracking with -XX:NativeMemoryTracking=summary or detail and capture baselines via jcmd <pid> VM.native_memory baseline, then compare later with jcmd <pid> VM.native_memory detail.diff.
Off‑Heap Memory
Detect off‑heap growth using pmap -x <pid> | sort -rn -k3 | head -30. For suspicious regions, dump memory with
gdb --batch --pid <pid> -ex "dump memory dump.bin <addr> <addr+size>"and inspect via hexdump -C dump.bin. Adjust -XX:MaxDirectMemorySize if needed.
GC Issues
Enable detailed GC logging with
-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps. Analyze Young GC frequency (adjust -Xmn, -XX:SurvivorRatio) and Full GC triggers (e.g., concurrent phase failures, promotion failures, large object allocation failures, explicit System.gc()). Use jinfo -flag +HeapDumpBeforeFullGC <pid> and jinfo -flag +HeapDumpAfterFullGC <pid> to compare dumps.
Network
Network problems are complex; common categories include timeouts, TCP queue overflow, RST packets, TIME_WAIT, and CLOSE_WAIT.
Timeouts
Distinguish between connection timeout, read/write timeout, and pool‑related timeouts; ensure client timeout < server timeout.
TCP Queue Overflow
Monitor SYN and accept queues with netstat -s | egrep "listen|LISTEN" and ss -lnt. Adjust kernel parameters somaxconn, tcp_max_syn_backlog, and servlet container settings ( acceptCount for Tomcat, acceptQueueSize for Jetty).
RST Packets
RST indicates abnormal connection termination; capture with tcpdump -i en0 tcp -w capture.cap and analyze in Wireshark.
TIME_WAIT & CLOSE_WAIT
Check counts via
netstat -n | awk '/^tcp/ {++S[$NF]} END {for(a in S) print a, S[a]}'or ss -ant. Tune kernel settings net.ipv4.tcp_tw_reuse=1 and net.ipv4.tcp_tw_recycle=1 to reduce TIME_WAIT buildup; investigate lingering CLOSE_WAIT sockets with thread dumps.
Overall, systematic use of the above commands and analysis tools helps quickly locate and resolve production‑level Java issues.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Top Architect
Top Architect focuses on sharing practical architecture knowledge, covering enterprise, system, website, large‑scale distributed, and high‑availability architectures, plus architecture adjustments using internet technologies. We welcome idea‑driven, sharing‑oriented architects to exchange and learn together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
