How to Diagnose and Fix Java CPU, Memory, Disk, and Network Issues Quickly
This guide walks through systematic troubleshooting of Java applications by checking CPU, disk, memory, and network layers, using tools like jstack, jmap, vmstat, iostat, and tcpdump to pinpoint and resolve performance and stability problems.
Online incidents usually involve CPU, disk, memory, and network problems, and most issues span multiple layers, so a systematic check of these four aspects is recommended.
Tools like jstack and jmap are not limited to a single aspect; typically you start with df, free, top, then use jstack and jmap as needed.
CPU
First check CPU-related problems, which are usually easier to locate. Causes include business logic loops, frequent GC, and excessive context switches. The most common cause is business or framework logic, which can be analyzed with jstack.
Analyzing CPU issues with jstack
Find the process ID with
ps(or
topto see high usage). Then run:
top -H -p <pid>to list threads with high CPU usage. Convert the thread ID to hexadecimal:
printf '%x
' <pid>and search the stack in jstack output:
jstack <pid> | grep 'nid' -C5 --colorFocus on threads in WAITING or TIMED_WAITING states. You can get an overview with:
<code>cat jstack.log | grep "java.lang.Thread.State" | sort -nr | uniq -c</code>Frequent GC
Use
jstat -gc <pid> 1000to monitor GC generation changes. Columns S0C/S1C, S0U/S1U, EC/EU, OC/OU, MC/MU represent capacities and usage of Survivor, Eden, old, and metaspace areas. YGC/YGT, FGC/FGCT, GCT show GC counts and times. If GC is too frequent, investigate further.
Context switches
Check context switches with
vmstat. The
cscolumn shows the number of switches. To monitor a specific PID, use:
pidstat -w <pid>where
cswchand
nvcswchindicate voluntary and involuntary switches.
Disk
Check disk space with
df -hl. For performance issues, use
iostat -d -k -x. The
%utilcolumn shows utilization, while
rrqm/sand
wrqm/sindicate read/write rates, helping locate the problematic disk.
Identify the process performing I/O with
iotop. Convert a thread ID (tid) to a PID via:
readlink -f /proc/*/task/<tid>/../..Then inspect the process I/O:
cat /proc/<pid>/ioand open files with
lsof -p <pid>.
Memory
Memory issues are more varied. Start with
freeto view overall usage.
Heap memory
Common OOM errors include:
java.lang.OutOfMemoryError: unable to create new native thread – insufficient native memory for thread stacks; check thread pools and consider reducing
-Xssor increasing OS limits.
java.lang.OutOfMemoryError: Java heap space – heap reached
-Xmx; look for leaks with jstack/jmap before increasing the heap.
java.lang.OutOfMemoryError: Metaspace – metaspace reached its limit; adjust
-XX:MaxMetaspaceSize.
java.lang.StackOverflowError – thread stack exceeds
-Xss; adjust
-Xssafter confirming code issues.
Using JMAP to locate leaks
Export a heap dump with:
jmap -dump:format=b,file=heap.bin <pid>Analyze the dump with Eclipse MAT, focusing on “Leak Suspects” or “Top Consumers”.
GC and threads
Monitor GC frequency with
jstat. Excessive young GC may indicate a too‑small Eden; adjust
-Xmnor
-XX:SurvivorRatio. Long GC pauses can be diagnosed by examining GC logs (e.g., G1 phases: Root Scanning, Object Copy, Ref Proc).
Full GC often signals problems such as concurrent marking failures, promotion failures, or large object allocation failures. Reduce explicit
System.gc()calls and consider enabling heap dumps before/after full GC with
-XX:+HeapDumpBeforeFullGCand
-XX:+HeapDumpAfterFullGC.
Network
Network issues are complex. Timeouts are divided into connection timeout, read/write timeout, and others (e.g., connectionAcquireTimeout, idleConnectionTimeout). Keep client timeouts shorter than server timeouts.
TCP queue overflow
Overflow can occur in the SYN queue or accept queue, leading to RST packets. Monitor with:
netstat -s | egrep "listen|LISTEN"and
ss -lntAdjust queue sizes via
backlog(or
acceptCountin Tomcat) and system parameters
somaxconn,
tcp_max_syn_backlog.
RST packets
RST indicates an abrupt connection reset, often caused by closed ports, forced termination, or missing TCP state. Capture with
tcpdumpand analyze in Wireshark.
TIME_WAIT and CLOSE_WAIT
TIME_WAIT ensures delayed packets are handled; excessive counts can be mitigated by enabling
net.ipv4.tcp_tw_reuse=1and
net.ipv4.tcp_tw_recycle=1or adjusting
tcp_max_tw_buckets.
CLOSE_WAIT usually results from applications not closing sockets properly; investigate with jstack to find blocked threads.
Source: https://fredal.xin/java-error-check
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.