How to Diagnose and Fix Java Service Failures: CPU, Memory, Disk, GC & Network
This guide walks through a systematic approach to troubleshooting Java service outages, covering CPU, disk, memory, GC, and network problems, and demonstrates how to use tools such as ps, top, jstack, iostat, jstat, netstat, ss, and various dump commands to pinpoint and resolve the root causes.
CPU
Start by checking CPU usage with ps to find the process ID, then use top -H -p <pid> to identify high‑CPU threads. Convert the thread ID to hexadecimal with printf '%x\n' <pid> to obtain the NID, and locate the corresponding stack trace using jstack <pid> | grep '<nid>' -C5. Analyze the WAITING and TIMED_WAITING states in the jstack output, and use
cat jstack.log | grep "java.lang.Thread.State" | sort -nr | uniq -cto get an overall view of thread states.
ps -ef | grep java
top -H -p 12345
printf '%x
' 12345
jstack 12345 | grep '0x42' -C5Disk
Check disk space with df -hl. For performance issues, run iostat -d -k -x to see utilization, %util, rrqm/s and wrqm/s. Identify the process responsible for I/O using iotop or lsof -p <pid>. Convert a thread ID to a process ID with readlink -f /proc/*/task/<tid>/../.., then inspect its I/O counters via cat /proc/<pid>/io.
df -hl
iostat -d -k -x
iotop -o
lsof -p 12345
readlink -f /proc/*/task/12345/../..
cat /proc/12345/ioMemory
Use free to get an overview of memory usage. Common OOM scenarios include native thread stack exhaustion, Java heap exhaustion, and Metaspace overflow. Diagnose native thread OOM with jstack or jmap, and adjust -Xss if needed. For heap OOM, check jstat -gc <pid> 1000 to monitor generation statistics, and consider increasing -Xmx. Use jmap -dump:format=b,file=heap.hprof <pid> to generate a heap dump, then analyze it with MAT (Eclipse Memory Analyzer) focusing on Leak Suspects or Top Consumers . Enable automatic heap dumps with -XX:+HeapDumpOnOutOfMemoryError and specify the dump path via -XX:HeapDumpPath.
free -m
jstat -gc 12345 1000
jmap -dump:format=b,file=heap.hprof 12345
-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/tmp/dumpsGC Issues
Enable detailed GC logging with
-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps. Analyze Young GC frequency and duration using jstat; if Young GC is too frequent, consider tuning -Xmn and -XX:SurvivorRatio. For long GC pauses, examine G1 log phases such as Root Scanning, Object Copy, and Ref Proc to identify bottlenecks. If Full GC occurs often, check for concurrent phase failures, promotion failures, or large object allocation failures, and adjust parameters like -XX:G1ReservePercent, -XX:InitiatingHeapOccupancyPercent, or -XX:G1HeapRegionSize. Use jinfo -flag +HeapDumpBeforeFullGC <pid> and jinfo -flag +HeapDumpAfterFullGC <pid> to capture dumps before and after Full GC.
-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps
jstat -gc 12345 1000
-XX:G1ReservePercent=20
-XX:InitiatingHeapOccupancyPercent=45
jinfo -flag +HeapDumpBeforeFullGC 12345
jinfo -flag +HeapDumpAfterFullGC 12345Network
Network problems often manifest as timeouts, TCP queue overflows, or RST packets. Distinguish between connection timeout and read/write timeout, and keep client timeout lower than server timeout. Use netstat -s | egrep "listen|LISTEN" to see queue overflow counts, and ss -lnt to view listen sockets and their backlog sizes. Adjust kernel parameters such as net.ipv4.tcp_tw_reuse, net.ipv4.tcp_tw_recycle, and net.ipv4.tcp_max_syn_backlog to mitigate TIME_WAIT and SYN queue issues. Capture RST traffic with tcpdump -i <iface> tcp -w dump.cap and analyze with Wireshark.
netstat -s | egrep "listen|LISTEN"
ss -lnt
sysctl -w net.ipv4.tcp_tw_reuse=1
sysctl -w net.ipv4.tcp_tw_recycle=1
sysctl -w net.ipv4.tcp_max_syn_backlog=1024
tcpdump -i eth0 tcp -w dump.capTIME_WAIT and CLOSE_WAIT
TIME_WAIT ensures proper connection termination and prevents stray packets; excessive TIME_WAIT can be reduced by enabling net.ipv4.tcp_tw_reuse and net.ipv4.tcp_tw_recycle. CLOSE_WAIT usually indicates that the application failed to close sockets after receiving FIN; investigate with jstack to find threads blocked on I/O or synchronization primitives.
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_tw_recycle = 1
jstack 12345 | grep -i "countdownlatch.await"Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Programmer DD
A tinkering programmer and author of "Spring Cloud Microservices in Action"
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
