Operations 21 min read

How to Diagnose and Fix Java CPU, Memory, Disk, and Network Issues Quickly

This guide walks through systematic troubleshooting of Java applications by checking CPU, disk, memory, and network layers, using tools like jstack, jmap, vmstat, iostat, and tcpdump to pinpoint and resolve performance and stability problems.

Efficient Ops
Efficient Ops
Efficient Ops
How to Diagnose and Fix Java CPU, Memory, Disk, and Network Issues Quickly

Online incidents usually involve CPU, disk, memory, and network problems, and most issues span multiple layers, so a systematic check of these four aspects is recommended.

Tools like jstack and jmap are not limited to a single aspect; typically you start with df, free, top, then use jstack and jmap as needed.

CPU

First check CPU-related problems, which are usually easier to locate. Causes include business logic loops, frequent GC, and excessive context switches. The most common cause is business or framework logic, which can be analyzed with jstack.

Analyzing CPU issues with jstack

Find the process ID with

ps

(or

top

to see high usage). Then run:

top -H -p <pid>

to list threads with high CPU usage. Convert the thread ID to hexadecimal:

printf '%x
' <pid>

and search the stack in jstack output:

jstack <pid> | grep 'nid' -C5 --color

Focus on threads in WAITING or TIMED_WAITING states. You can get an overview with:

<code>cat jstack.log | grep "java.lang.Thread.State" | sort -nr | uniq -c</code>

Frequent GC

Use

jstat -gc &lt;pid&gt; 1000

to monitor GC generation changes. Columns S0C/S1C, S0U/S1U, EC/EU, OC/OU, MC/MU represent capacities and usage of Survivor, Eden, old, and metaspace areas. YGC/YGT, FGC/FGCT, GCT show GC counts and times. If GC is too frequent, investigate further.

Context switches

Check context switches with

vmstat

. The

cs

column shows the number of switches. To monitor a specific PID, use:

pidstat -w &lt;pid&gt;

where

cswch

and

nvcswch

indicate voluntary and involuntary switches.

Disk

Check disk space with

df -hl

. For performance issues, use

iostat -d -k -x

. The

%util

column shows utilization, while

rrqm/s

and

wrqm/s

indicate read/write rates, helping locate the problematic disk.

Identify the process performing I/O with

iotop

. Convert a thread ID (tid) to a PID via:

readlink -f /proc/*/task/&lt;tid&gt;/../..

Then inspect the process I/O:

cat /proc/&lt;pid&gt;/io

and open files with

lsof -p &lt;pid&gt;

.

Memory

Memory issues are more varied. Start with

free

to view overall usage.

Heap memory

Common OOM errors include:

java.lang.OutOfMemoryError: unable to create new native thread – insufficient native memory for thread stacks; check thread pools and consider reducing

-Xss

or increasing OS limits.

java.lang.OutOfMemoryError: Java heap space – heap reached

-Xmx

; look for leaks with jstack/jmap before increasing the heap.

java.lang.OutOfMemoryError: Metaspace – metaspace reached its limit; adjust

-XX:MaxMetaspaceSize

.

java.lang.StackOverflowError – thread stack exceeds

-Xss

; adjust

-Xss

after confirming code issues.

Using JMAP to locate leaks

Export a heap dump with:

jmap -dump:format=b,file=heap.bin &lt;pid&gt;

Analyze the dump with Eclipse MAT, focusing on “Leak Suspects” or “Top Consumers”.

GC and threads

Monitor GC frequency with

jstat

. Excessive young GC may indicate a too‑small Eden; adjust

-Xmn

or

-XX:SurvivorRatio

. Long GC pauses can be diagnosed by examining GC logs (e.g., G1 phases: Root Scanning, Object Copy, Ref Proc).

Full GC often signals problems such as concurrent marking failures, promotion failures, or large object allocation failures. Reduce explicit

System.gc()

calls and consider enabling heap dumps before/after full GC with

-XX:+HeapDumpBeforeFullGC

and

-XX:+HeapDumpAfterFullGC

.

Network

Network issues are complex. Timeouts are divided into connection timeout, read/write timeout, and others (e.g., connectionAcquireTimeout, idleConnectionTimeout). Keep client timeouts shorter than server timeouts.

TCP queue overflow

Overflow can occur in the SYN queue or accept queue, leading to RST packets. Monitor with:

netstat -s | egrep "listen|LISTEN"

and

ss -lnt

Adjust queue sizes via

backlog

(or

acceptCount

in Tomcat) and system parameters

somaxconn

,

tcp_max_syn_backlog

.

RST packets

RST indicates an abrupt connection reset, often caused by closed ports, forced termination, or missing TCP state. Capture with

tcpdump

and analyze in Wireshark.

TIME_WAIT and CLOSE_WAIT

TIME_WAIT ensures delayed packets are handled; excessive counts can be mitigated by enabling

net.ipv4.tcp_tw_reuse=1

and

net.ipv4.tcp_tw_recycle=1

or adjusting

tcp_max_tw_buckets

.

CLOSE_WAIT usually results from applications not closing sockets properly; investigate with jstack to find blocked threads.

Source: https://fredal.xin/java-error-check

JavaperformancenetworktroubleshootingCPUMemoryGC
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.