How to Diagnose and Resolve Common Java Server Performance Issues
This guide walks through systematic troubleshooting of Java server problems—including CPU spikes, memory leaks, disk bottlenecks, GC pauses, and network anomalies—by using tools such as jstack, jmap, jstat, vmstat, iostat, netstat, and ss to pinpoint root causes and apply targeted fixes.
Overview
Online incidents often involve CPU, disk, memory, and network problems; most issues span multiple layers, so a systematic four‑step investigation (CPU → Disk → Memory → Network) is recommended.
CPU
Start by checking CPU usage. CPU anomalies are usually easier to locate. Common causes include business‑logic loops, frequent GC, and excessive context switches; the most frequent cause is problematic business or framework logic, which can be examined with jstack.
Using jstack to analyze CPU problems
Find the process PID with ps (or top to see which process consumes the most CPU). Then run: top -H -p pid to identify high‑CPU threads. Convert the thread ID to hexadecimal:
printf '%x
' pidResulting nid is used to search the jstack output: jstack pid | grep 'nid' -C5 --color Focus on threads in WAITING or TIMED_WAITING states; BLOCKED threads are less common. For a quick overview of thread states, run:
cat jstack.log | grep "java.lang.Thread.State" | sort -nr | uniq -cFrequent GC
Use jstat -gc pid 1000 to monitor GC generation changes (sampling interval 1000 ms). Columns S0C/S1C, S0U/S1U, EC/EU, OC/OU, MC/MU represent Survivor, Eden, Old, and Metaspace capacities and usage. YGC/YGT, FGC/FGCT, GCT show Young GC, Full GC counts and times. If GC appears too frequent, investigate further with dump analysis.
Context Switches
Inspect context switches with vmstat. The cs column shows the number of switches. To monitor a specific PID, use: pidstat -w pid Columns cswch and nvcswch indicate voluntary and involuntary switches.
Disk
Disk issues are also fundamental. Check disk space with:
df -hlPerformance problems can be diagnosed with:
iostat -d -k -xThe %util column shows disk write intensity; rrqm/s and wrqm/s indicate read/write speeds, helping locate the problematic disk. Identify the responsible process with iotop or by converting a thread ID to PID via readlink -f /proc/*/task/tid/../.., then inspect I/O with:
cat /proc/pid/ioList open files with lsof -p pid.
Memory
Memory issues are more complex and include OOM, GC problems, and off‑heap memory. Start with free to view overall memory status.
Heap Memory OOM
Typical OOM messages:
Exception in thread “main” java.lang.OutOfMemoryError: unable to create new native thread – insufficient native memory for thread stacks; check thread pools, use jstack / jmap, or increase OS limits.
Exception in thread “main” java.lang.OutOfMemoryError: Java heap space – heap reached -Xmx limit; look for leaks with jstack / jmap, then consider increasing -Xmx.
Exception in thread “main” java.lang.OutOfMemoryError: Metaspace – metaspace reached -XX:MaxMetaspaceSize; adjust with -XX:MaxPermSize for older JVMs.
Stack Overflow
Indicates thread stack exceeds -Xss. Reduce -Xss or investigate code paths.
Using JMAP to locate memory leaks
Export a heap dump:
jmap -dump:format=b,file=filename pidAnalyze the dump with MAT (Memory Analyzer Tool), focusing on “Leak Suspects” or “Top Consumers”.
GC Issues
GC problems can cause CPU load and memory pressure. Enable detailed GC logging with
-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps. Analyze Young GC frequency with jstat; if too frequent, consider increasing -Xmn or -XX:SurvivorRatio. For long GC pauses, examine G1 log phases such as Root Scanning, Object Copy, and Ref Proc.
Full GC triggers include concurrent marking failure, promotion failure, large object allocation failure, or explicit System.gc(). Dump heap before/after Full GC with -XX:HeapDumpPath and inspect with jinfo or jmap.
jinfo -flag +HeapDumpBeforeFullGC pid
jinfo -flag +HeapDumpAfterFullGC pidNetwork
Network problems are complex and often the hardest to diagnose.
Timeouts
Distinguish between connection timeout and read/write timeout. Keep client timeout smaller than server timeout to avoid hanging connections.
TCP Queue Overflow
Two queues exist: SYN (half‑open) and accept (full‑open). If the accept queue is full during the third handshake, the server may drop the ACK or send an RST depending on tcp_abort_on_overflow. Monitor overflow with:
netstat -s | egrep "listen|LISTEN"Check queue sizes with ss -lnt and adjust OS parameters net.ipv4.tcp_tw_reuse, net.ipv4.tcp_tw_recycle, or tcp_max_tw_buckets as needed.
# enable reuse of TIME‑WAIT sockets
net.ipv4.tcp_tw_reuse = 1
# enable fast recycle of TIME‑WAIT sockets
net.ipv4.tcp_tw_recycle = 1RST Packets
RST indicates an abnormal connection reset, often caused by sending data to a closed socket or by queue overflows. Use tcpdump and Wireshark to capture and analyze RST packets.
tcpdump -i en0 tcp -w capture.capTIME_WAIT and CLOSE_WAIT
TIME_WAIT ensures delayed packets are handled and prevents premature RSTs; excessive TIME_WAIT can be mitigated by enabling reuse/recycle as above. CLOSE_WAIT usually results from applications not closing sockets properly; investigate with jstack to find threads stuck in I/O or waiting on latches.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
