Operations 22 min read

How to Diagnose and Fix Java Service Failures: CPU, Memory, Disk, GC & Network

This guide walks through a systematic approach to troubleshooting Java service outages, covering CPU, disk, memory, GC, and network problems, and demonstrates how to use tools such as ps, top, jstack, iostat, jstat, netstat, ss, and various dump commands to pinpoint and resolve the root causes.

Programmer DD
Programmer DD
Programmer DD
How to Diagnose and Fix Java Service Failures: CPU, Memory, Disk, GC & Network

CPU

Start by checking CPU usage with ps to find the process ID, then use top -H -p <pid> to identify high‑CPU threads. Convert the thread ID to hexadecimal with printf '%x\n' <pid> to obtain the NID, and locate the corresponding stack trace using jstack <pid> | grep '<nid>' -C5. Analyze the WAITING and TIMED_WAITING states in the jstack output, and use

cat jstack.log | grep "java.lang.Thread.State" | sort -nr | uniq -c

to get an overall view of thread states.

ps -ef | grep java
 top -H -p 12345
 printf '%x
' 12345
 jstack 12345 | grep '0x42' -C5

Disk

Check disk space with df -hl. For performance issues, run iostat -d -k -x to see utilization, %util, rrqm/s and wrqm/s. Identify the process responsible for I/O using iotop or lsof -p <pid>. Convert a thread ID to a process ID with readlink -f /proc/*/task/<tid>/../.., then inspect its I/O counters via cat /proc/<pid>/io.

df -hl
 iostat -d -k -x
 iotop -o
 lsof -p 12345
 readlink -f /proc/*/task/12345/../..
 cat /proc/12345/io

Memory

Use free to get an overview of memory usage. Common OOM scenarios include native thread stack exhaustion, Java heap exhaustion, and Metaspace overflow. Diagnose native thread OOM with jstack or jmap, and adjust -Xss if needed. For heap OOM, check jstat -gc <pid> 1000 to monitor generation statistics, and consider increasing -Xmx. Use jmap -dump:format=b,file=heap.hprof <pid> to generate a heap dump, then analyze it with MAT (Eclipse Memory Analyzer) focusing on Leak Suspects or Top Consumers . Enable automatic heap dumps with -XX:+HeapDumpOnOutOfMemoryError and specify the dump path via -XX:HeapDumpPath.

free -m
 jstat -gc 12345 1000
 jmap -dump:format=b,file=heap.hprof 12345
 -XX:+HeapDumpOnOutOfMemoryError
 -XX:HeapDumpPath=/tmp/dumps

GC Issues

Enable detailed GC logging with

-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps

. Analyze Young GC frequency and duration using jstat; if Young GC is too frequent, consider tuning -Xmn and -XX:SurvivorRatio. For long GC pauses, examine G1 log phases such as Root Scanning, Object Copy, and Ref Proc to identify bottlenecks. If Full GC occurs often, check for concurrent phase failures, promotion failures, or large object allocation failures, and adjust parameters like -XX:G1ReservePercent, -XX:InitiatingHeapOccupancyPercent, or -XX:G1HeapRegionSize. Use jinfo -flag +HeapDumpBeforeFullGC <pid> and jinfo -flag +HeapDumpAfterFullGC <pid> to capture dumps before and after Full GC.

-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps
 jstat -gc 12345 1000
 -XX:G1ReservePercent=20
 -XX:InitiatingHeapOccupancyPercent=45
 jinfo -flag +HeapDumpBeforeFullGC 12345
 jinfo -flag +HeapDumpAfterFullGC 12345

Network

Network problems often manifest as timeouts, TCP queue overflows, or RST packets. Distinguish between connection timeout and read/write timeout, and keep client timeout lower than server timeout. Use netstat -s | egrep "listen|LISTEN" to see queue overflow counts, and ss -lnt to view listen sockets and their backlog sizes. Adjust kernel parameters such as net.ipv4.tcp_tw_reuse, net.ipv4.tcp_tw_recycle, and net.ipv4.tcp_max_syn_backlog to mitigate TIME_WAIT and SYN queue issues. Capture RST traffic with tcpdump -i <iface> tcp -w dump.cap and analyze with Wireshark.

netstat -s | egrep "listen|LISTEN"
 ss -lnt
 sysctl -w net.ipv4.tcp_tw_reuse=1
 sysctl -w net.ipv4.tcp_tw_recycle=1
 sysctl -w net.ipv4.tcp_max_syn_backlog=1024
 tcpdump -i eth0 tcp -w dump.cap

TIME_WAIT and CLOSE_WAIT

TIME_WAIT ensures proper connection termination and prevents stray packets; excessive TIME_WAIT can be reduced by enabling net.ipv4.tcp_tw_reuse and net.ipv4.tcp_tw_recycle. CLOSE_WAIT usually indicates that the application failed to close sockets after receiving FIN; investigate with jstack to find threads blocked on I/O or synchronization primitives.

net.ipv4.tcp_tw_reuse = 1
 net.ipv4.tcp_tw_recycle = 1
 jstack 12345 | grep -i "countdownlatch.await"
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

javaperformanceCPUMemorygc
Programmer DD
Written by

Programmer DD

A tinkering programmer and author of "Spring Cloud Microservices in Action"

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.