Operations 21 min read

How to Diagnose and Resolve Java CPU, Memory, Disk, and Network Issues in Production

This guide walks through a systematic four‑step approach—CPU, disk, memory, and network—to pinpoint Java service failures using tools like jstack, jmap, top, vmstat, iostat, jstat, netstat, ss, and tcpdump, covering OOM, GC, off‑heap, and TCP state problems.

dbaplus Community

Mar 17, 2021

How to Diagnose and Resolve Java CPU, Memory, Disk, and Network Issues in Production

Overview

Online incidents in Java services usually involve CPU, disk, memory, or network problems, often simultaneously. A systematic four‑step inspection—CPU → Disk → Memory → Network—combined with diagnostic tools (jstack, jmap, top, vmstat, iostat, jstat, netstat, ss, tcpdump, etc.) helps pinpoint the root cause.

CPU Diagnosis

Start by locating high‑CPU threads. Use ps to get the PID, then top -H -p <pid> to list threads by CPU usage. Convert the thread ID (nid) to hexadecimal with

printf '%x
' <tid>

and search the stack trace:

jstack <pid> | grep '<nid>' -C5 --color

Focus on threads in WAITING or TIMED_WAITING states; a large number indicates a problem. Frequent GC or excessive context switches can also manifest as CPU spikes.

Frequent GC

Run jstat -gc <pid> 1000 to monitor generation‑level GC activity. Look at S0C/S0U, S1C/S1U, EC/EU, OC/OU, MC/MU, YGC/YGT, FGC/FGCT. If GC is too frequent, investigate heap usage or adjust GC parameters.

Context Switches

Use vmstat to view the cs column (context switches). For a specific PID, pidstat -w <pid> shows voluntary ( cswch) and involuntary ( nvcswch) switches.

Disk Diagnosis

Check filesystem space with df -hl. For performance, run iostat -d -k -x and examine the %util, rrqm/s, and wrqm/s columns to identify saturated disks. Identify the responsible process with iotop or by mapping a thread ID to a PID via readlink -f /proc/*/task/<tid>/../.., then inspect I/O stats with cat /proc/<pid>/io and lsof -p <pid>.

Memory Diagnosis

Start with free -h to see overall memory usage. Common heap‑related problems include OOM and StackOverflow.

Out‑Of‑Memory (OOM)

Native thread creation failure:

java.lang.OutOfMemoryError: unable to create new native thread

. Reduce thread stack size with -Xss or raise OS limits in /etc/security/limits.conf.

Java heap space: java.lang.OutOfMemoryError: Java heap space. Look for memory leaks with jstack / jmap, then increase -Xmx if necessary.

Metaspace exhaustion: java.lang.OutOfMemoryError: Metaspace. Adjust -XX:MaxMetaspaceSize or -XX:MaxPermSize (pre‑Java 8).

StackOverflowError

Occurs when a thread’s stack exceeds -Xss. Reduce recursion depth or increase -Xss cautiously.

Heap Dump Analysis

Generate a heap dump with jmap -dump:format=b,file=heap.hprof <pid> and analyze it using Eclipse MAT ( mat). Look at “Leak Suspects”, “Top Consumers”, or “Thread Overview”. Enable automatic dumps on OOM with -XX:+HeapDumpOnOutOfMemoryError.

Off‑Heap Memory

Off‑heap leaks (e.g., DirectByteBuffer) appear as OutOfDirectMemoryError or OutOfMemoryError: Direct buffer memory. Inspect native memory with pmap -x <pid> and

gdb --batch --pid <pid> -ex "dump memory dump.bin <addr> <addr+size>"

. Use jcmd <pid> VM.native_memory summary or detail to track native allocations, and adjust -XX:NativeMemoryTracking=summary (or detail) and -XX:MaxDirectMemorySize as needed.

Garbage‑Collection Issues

Enable GC logging with

-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps

. Analyze Young GC frequency, duration, and Full GC triggers. For G1, consider tuning -XX:G1ReservePercent, -XX:InitiatingHeapOccupancyPercent, and -XX:ConcGCThreads. Use jinfo -flag +HeapDumpBeforeFullGC <pid> and jinfo -flag +HeapDumpAfterFullGC <pid> to compare pre‑ and post‑GC heap states.

Network Diagnosis

Network problems are often the most elusive. Distinguish between connection timeout, read/write timeout, and other timeout categories. Keep client‑side timeouts shorter than server‑side values.

TCP Queue Overflow

Monitor SYN and accept queues with netstat -s | egrep "listen|LISTEN" and ss -lnt. Adjust kernel parameters net.ipv4.tcp_max_syn_backlog, somaxconn, and tcp_tw_reuse / tcp_tw_recycle to mitigate overflow.

RST Packets

RST indicates abrupt connection termination. Common causes: non‑existent port, intentional FIN replacement via SO_LINGER, or peer‑side crashes. Capture RST traffic with tcpdump -i eth0 tcp -w capture.cap and inspect in Wireshark.

TIME_WAIT and CLOSE_WAIT

Use

netstat -n | awk '/^tcp/ {++S[$NF]} END {for(a in S) print a, S[a]}'

or ss -ant to count sockets in these states. Reduce excessive TIME_WAIT by enabling net.ipv4.tcp_tw_reuse=1 and net.ipv4.tcp_tw_recycle=1. CLOSE_WAIT usually stems from applications that never close sockets; investigate with jstack to find blocked threads.

By following this structured inspection—CPU → Disk → Memory → Network—and leveraging the listed commands, engineers can quickly locate and remediate the root cause of most production‑grade Java incidents.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Java Network troubleshooting CPU memory GC

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.