Comprehensive Guide to Troubleshooting CPU, Disk, Memory, GC, and Network Issues in Java Applications
This article provides a step‑by‑step methodology for diagnosing and resolving common online failures in Java services, covering CPU bottlenecks, disk I/O problems, memory leaks, garbage‑collection inefficiencies, and network anomalies such as timeouts, TCP queue overflows, and RST packets.
Online service failures often involve CPU, disk, memory, and network problems; most incidents span multiple layers, so a systematic four‑step inspection—CPU → Disk → Memory → Network—is recommended, using tools like jstack, jmap, jstat, top, vmstat, iostat, netstat, and ss.
CPU : Identify high‑CPU threads with ps to get the PID, then top -H -p <pid> to find hot threads, convert the thread ID to hexadecimal ( printf '%x\n' <pid>), and locate the stack in a jstack dump using jstack <pid> | grep '<nid>' -C5 --color. Focus on WAITING/TIMED_WAITING states and use
cat jstack.log | grep "java.lang.Thread.State" | sort -nr | uniq -cto spot problematic threads.
Disk : Check filesystem space with df -hl and monitor I/O performance using iostat -d -k -x. Identify the busiest disks via the %util column and pinpoint the responsible process with iotop. Convert a thread ID to a PID using readlink -f /proc/*/task/<tid>/../.. and inspect its I/O via cat /proc/<pid>/io or lsof -p <pid>.
Memory : Start with free to view overall usage, then differentiate between heap OOM ( java.lang.OutOfMemoryError: Java heap space), native thread stack OOM, and metaspace OOM. Use jmap -histo:live <pid> and Eclipse MAT ( jmap -dump:format=b,file=heap.hprof <pid>) to locate leaks. For off‑heap leaks, monitor native memory with pmap -x <pid> | sort -rn -k3 | head -30, capture dumps via
gdb --batch --pid <pid> -ex "dump memory dump.bin 0x<addr> 0x<addr+size>", and analyze with jcmd <pid> VM.native_memory. Enable -XX:+HeapDumpOnOutOfMemoryError for automatic dumps.
GC : Enable detailed GC logging with
-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps. Use jstat -gc <pid> 1000 to observe Young vs. Full GC frequency. For G1, adjust -Xmn, -XX:SurvivorRatio, -XX:G1ReservePercent, and -XX:InitiatingHeapOccupancyPercent to mitigate frequent or long‑running collections. Dump heap before/after Full GC with jinfo -flag +HeapDumpBeforeFullGC <pid> and jinfo -flag +HeapDumpAfterFullGC <pid>.
Network : Diagnose timeouts, TCP queue overflows, and RST packets. Use netstat -s | egrep "listen|LISTEN" to view overflow counters, ss -lnt to check backlog sizes, and adjust kernel parameters ( net.ipv4.tcp_tw_reuse=1, net.ipv4.tcp_tw_recycle=1) to recycle TIME_WAIT sockets. Monitor connection states with
netstat -n | awk '/^tcp/ {++S[$NF]} END {for(a in S) print a, S[a]}'or ss -ant. Capture RST traffic with tcpdump -i en0 tcp -w capture.cap and analyze in Wireshark.
By following these systematic checks and leveraging the listed commands, developers can quickly pinpoint the root cause of performance degradations and restore service stability.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Top Architect
Top Architect focuses on sharing practical architecture knowledge, covering enterprise, system, website, large‑scale distributed, and high‑availability architectures, plus architecture adjustments using internet technologies. We welcome idea‑driven, sharing‑oriented architects to exchange and learn together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
