How to Diagnose and Resolve Common Java Server Performance Issues
This guide walks through systematic troubleshooting of Java server problems—including CPU spikes, memory leaks, disk bottlenecks, GC pauses, and network anomalies—by using tools such as jstack, jmap, jstat, vmstat, iostat, netstat, and ss to pinpoint root causes and apply targeted fixes.
Overview
Online incidents often involve CPU, disk, memory, and network problems; most issues span multiple layers, so a systematic four‑step investigation (CPU → Disk → Memory → Network) is recommended.
CPU
Start by checking CPU usage. CPU anomalies are usually easier to locate. Common causes include business‑logic loops, frequent GC, and excessive context switches; the most frequent cause is problematic business or framework logic, which can be examined with
jstack.
Using jstack to analyze CPU problems
Find the process PID with
ps(or
topto see which process consumes the most CPU). Then run:
top -H -p pidto identify high‑CPU threads. Convert the thread ID to hexadecimal:
printf '%x
' pidResulting
nidis used to search the jstack output:
jstack pid | grep 'nid' -C5 --colorFocus on threads in WAITING or TIMED_WAITING states; BLOCKED threads are less common. For a quick overview of thread states, run:
cat jstack.log | grep "java.lang.Thread.State" | sort -nr | uniq -cFrequent GC
Use
jstat -gc pid 1000to monitor GC generation changes (sampling interval 1000 ms). Columns S0C/S1C, S0U/S1U, EC/EU, OC/OU, MC/MU represent Survivor, Eden, Old, and Metaspace capacities and usage. YGC/YGT, FGC/FGCT, GCT show Young GC, Full GC counts and times. If GC appears too frequent, investigate further with dump analysis.
Context Switches
Inspect context switches with
vmstat. The
cscolumn shows the number of switches. To monitor a specific PID, use:
pidstat -w pidColumns
cswchand
nvcswchindicate voluntary and involuntary switches.
Disk
Disk issues are also fundamental. Check disk space with:
df -hlPerformance problems can be diagnosed with:
iostat -d -k -xThe
%utilcolumn shows disk write intensity;
rrqm/sand
wrqm/sindicate read/write speeds, helping locate the problematic disk. Identify the responsible process with
iotopor by converting a thread ID to PID via
readlink -f /proc/*/task/tid/../.., then inspect I/O with:
cat /proc/pid/ioList open files with
lsof -p pid.
Memory
Memory issues are more complex and include OOM, GC problems, and off‑heap memory. Start with
freeto view overall memory status.
Heap Memory OOM
Typical OOM messages:
Exception in thread “main” java.lang.OutOfMemoryError: unable to create new native thread – insufficient native memory for thread stacks; check thread pools, use
jstack/
jmap, or increase OS limits.
Exception in thread “main” java.lang.OutOfMemoryError: Java heap space – heap reached
-Xmxlimit; look for leaks with
jstack/
jmap, then consider increasing
-Xmx.
Exception in thread “main” java.lang.OutOfMemoryError: Metaspace – metaspace reached
-XX:MaxMetaspaceSize; adjust with
-XX:MaxPermSizefor older JVMs.
Stack Overflow
Indicates thread stack exceeds
-Xss. Reduce
-Xssor investigate code paths.
Using JMAP to locate memory leaks
Export a heap dump:
jmap -dump:format=b,file=filename pidAnalyze the dump with MAT (Memory Analyzer Tool), focusing on “Leak Suspects” or “Top Consumers”.
GC Issues
GC problems can cause CPU load and memory pressure. Enable detailed GC logging with
-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps. Analyze Young GC frequency with
jstat; if too frequent, consider increasing
-Xmnor
-XX:SurvivorRatio. For long GC pauses, examine G1 log phases such as Root Scanning, Object Copy, and Ref Proc.
Full GC triggers include concurrent marking failure, promotion failure, large object allocation failure, or explicit
System.gc(). Dump heap before/after Full GC with
-XX:HeapDumpPathand inspect with
jinfoor
jmap.
jinfo -flag +HeapDumpBeforeFullGC pid
jinfo -flag +HeapDumpAfterFullGC pidNetwork
Network problems are complex and often the hardest to diagnose.
Timeouts
Distinguish between connection timeout and read/write timeout. Keep client timeout smaller than server timeout to avoid hanging connections.
TCP Queue Overflow
Two queues exist: SYN (half‑open) and accept (full‑open). If the accept queue is full during the third handshake, the server may drop the ACK or send an RST depending on
tcp_abort_on_overflow. Monitor overflow with:
netstat -s | egrep "listen|LISTEN"Check queue sizes with
ss -lntand adjust OS parameters
net.ipv4.tcp_tw_reuse,
net.ipv4.tcp_tw_recycle, or
tcp_max_tw_bucketsas needed.
# enable reuse of TIME‑WAIT sockets
net.ipv4.tcp_tw_reuse = 1
# enable fast recycle of TIME‑WAIT sockets
net.ipv4.tcp_tw_recycle = 1RST Packets
RST indicates an abnormal connection reset, often caused by sending data to a closed socket or by queue overflows. Use
tcpdumpand Wireshark to capture and analyze RST packets.
tcpdump -i en0 tcp -w capture.capTIME_WAIT and CLOSE_WAIT
TIME_WAIT ensures delayed packets are handled and prevents premature RSTs; excessive TIME_WAIT can be mitigated by enabling reuse/recycle as above. CLOSE_WAIT usually results from applications not closing sockets properly; investigate with
jstackto find threads stuck in I/O or waiting on latches.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.