How to Diagnose and Resolve Java CPU, Memory, Disk, and Network Issues in Production
This guide walks through a systematic four‑step approach—CPU, disk, memory, and network—to pinpoint Java service failures using tools like jstack, jmap, top, vmstat, iostat, jstat, netstat, ss, and tcpdump, covering OOM, GC, off‑heap, and TCP state problems.
Overview
Online incidents in Java services usually involve CPU, disk, memory, or network problems, often simultaneously. A systematic four‑step inspection—CPU → Disk → Memory → Network—combined with diagnostic tools (jstack, jmap, top, vmstat, iostat, jstat, netstat, ss, tcpdump, etc.) helps pinpoint the root cause.
CPU Diagnosis
Start by locating high‑CPU threads. Use ps to get the PID, then top -H -p <pid> to list threads by CPU usage. Convert the thread ID (nid) to hexadecimal with
printf '%x
' <tid>and search the stack trace:
jstack <pid> | grep '<nid>' -C5 --colorFocus on threads in WAITING or TIMED_WAITING states; a large number indicates a problem. Frequent GC or excessive context switches can also manifest as CPU spikes.
Frequent GC
Run jstat -gc <pid> 1000 to monitor generation‑level GC activity. Look at S0C/S0U, S1C/S1U, EC/EU, OC/OU, MC/MU, YGC/YGT, FGC/FGCT. If GC is too frequent, investigate heap usage or adjust GC parameters.
Context Switches
Use vmstat to view the cs column (context switches). For a specific PID, pidstat -w <pid> shows voluntary ( cswch) and involuntary ( nvcswch) switches.
Disk Diagnosis
Check filesystem space with df -hl. For performance, run iostat -d -k -x and examine the %util, rrqm/s, and wrqm/s columns to identify saturated disks. Identify the responsible process with iotop or by mapping a thread ID to a PID via readlink -f /proc/*/task/<tid>/../.., then inspect I/O stats with cat /proc/<pid>/io and lsof -p <pid>.
Memory Diagnosis
Start with free -h to see overall memory usage. Common heap‑related problems include OOM and StackOverflow.
Out‑Of‑Memory (OOM)
Native thread creation failure:
java.lang.OutOfMemoryError: unable to create new native thread. Reduce thread stack size with -Xss or raise OS limits in /etc/security/limits.conf.
Java heap space: java.lang.OutOfMemoryError: Java heap space. Look for memory leaks with jstack / jmap, then increase -Xmx if necessary.
Metaspace exhaustion: java.lang.OutOfMemoryError: Metaspace. Adjust -XX:MaxMetaspaceSize or -XX:MaxPermSize (pre‑Java 8).
StackOverflowError
Occurs when a thread’s stack exceeds -Xss. Reduce recursion depth or increase -Xss cautiously.
Heap Dump Analysis
Generate a heap dump with jmap -dump:format=b,file=heap.hprof <pid> and analyze it using Eclipse MAT ( mat). Look at “Leak Suspects”, “Top Consumers”, or “Thread Overview”. Enable automatic dumps on OOM with -XX:+HeapDumpOnOutOfMemoryError.
Off‑Heap Memory
Off‑heap leaks (e.g., DirectByteBuffer) appear as OutOfDirectMemoryError or OutOfMemoryError: Direct buffer memory. Inspect native memory with pmap -x <pid> and
gdb --batch --pid <pid> -ex "dump memory dump.bin <addr> <addr+size>". Use jcmd <pid> VM.native_memory summary or detail to track native allocations, and adjust -XX:NativeMemoryTracking=summary (or detail) and -XX:MaxDirectMemorySize as needed.
Garbage‑Collection Issues
Enable GC logging with
-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps. Analyze Young GC frequency, duration, and Full GC triggers. For G1, consider tuning -XX:G1ReservePercent, -XX:InitiatingHeapOccupancyPercent, and -XX:ConcGCThreads. Use jinfo -flag +HeapDumpBeforeFullGC <pid> and jinfo -flag +HeapDumpAfterFullGC <pid> to compare pre‑ and post‑GC heap states.
Network Diagnosis
Network problems are often the most elusive. Distinguish between connection timeout, read/write timeout, and other timeout categories. Keep client‑side timeouts shorter than server‑side values.
TCP Queue Overflow
Monitor SYN and accept queues with netstat -s | egrep "listen|LISTEN" and ss -lnt. Adjust kernel parameters net.ipv4.tcp_max_syn_backlog, somaxconn, and tcp_tw_reuse / tcp_tw_recycle to mitigate overflow.
RST Packets
RST indicates abrupt connection termination. Common causes: non‑existent port, intentional FIN replacement via SO_LINGER, or peer‑side crashes. Capture RST traffic with tcpdump -i eth0 tcp -w capture.cap and inspect in Wireshark.
TIME_WAIT and CLOSE_WAIT
Use
netstat -n | awk '/^tcp/ {++S[$NF]} END {for(a in S) print a, S[a]}'or ss -ant to count sockets in these states. Reduce excessive TIME_WAIT by enabling net.ipv4.tcp_tw_reuse=1 and net.ipv4.tcp_tw_recycle=1. CLOSE_WAIT usually stems from applications that never close sockets; investigate with jstack to find blocked threads.
By following this structured inspection—CPU → Disk → Memory → Network—and leveraging the listed commands, engineers can quickly locate and remediate the root cause of most production‑grade Java incidents.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
