Operations 21 min read

How to Diagnose and Fix Java CPU, Memory, Disk, and Network Issues Quickly

This guide walks through systematic troubleshooting of Java applications by checking CPU, disk, memory, and network layers, using tools like jstack, jmap, vmstat, iostat, and tcpdump to pinpoint and resolve performance and stability problems.

Efficient Ops

Feb 25, 2024

How to Diagnose and Fix Java CPU, Memory, Disk, and Network Issues Quickly

Online incidents usually involve CPU, disk, memory, and network problems, and most issues span multiple layers, so a systematic check of these four aspects is recommended.

Tools like jstack and jmap are not limited to a single aspect; typically you start with df, free, top, then use jstack and jmap as needed.

CPU

First check CPU-related problems, which are usually easier to locate. Causes include business logic loops, frequent GC, and excessive context switches. The most common cause is business or framework logic, which can be analyzed with jstack.

Analyzing CPU issues with jstack

Find the process ID with ps (or top to see high usage). Then run: top -H -p <pid> to list threads with high CPU usage. Convert the thread ID to hexadecimal:

printf '%x
' <pid>

and search the stack in jstack output: jstack <pid> | grep 'nid' -C5 --color Focus on threads in WAITING or TIMED_WAITING states. You can get an overview with:

cat jstack.log | grep "java.lang.Thread.State" | sort -nr | uniq -c

Frequent GC

Use jstat -gc <pid> 1000 to monitor GC generation changes. Columns S0C/S1C, S0U/S1U, EC/EU, OC/OU, MC/MU represent capacities and usage of Survivor, Eden, old, and metaspace areas. YGC/YGT, FGC/FGCT, GCT show GC counts and times. If GC is too frequent, investigate further.

Context switches

Check context switches with vmstat. The cs column shows the number of switches. To monitor a specific PID, use: pidstat -w <pid> where cswch and nvcswch indicate voluntary and involuntary switches.

Disk

Check disk space with df -hl. For performance issues, use iostat -d -k -x. The %util column shows utilization, while rrqm/s and wrqm/s indicate read/write rates, helping locate the problematic disk.

Identify the process performing I/O with iotop. Convert a thread ID (tid) to a PID via: readlink -f /proc/*/task/<tid>/../.. Then inspect the process I/O: cat /proc/<pid>/io and open files with lsof -p <pid>.

Memory

Memory issues are more varied. Start with free to view overall usage.

Heap memory

Common OOM errors include:

java.lang.OutOfMemoryError: unable to create new native thread – insufficient native memory for thread stacks; check thread pools and consider reducing -Xss or increasing OS limits.

java.lang.OutOfMemoryError: Java heap space – heap reached -Xmx; look for leaks with jstack/jmap before increasing the heap.

java.lang.OutOfMemoryError: Metaspace – metaspace reached its limit; adjust -XX:MaxMetaspaceSize.

java.lang.StackOverflowError – thread stack exceeds -Xss; adjust -Xss after confirming code issues.

Using JMAP to locate leaks

Export a heap dump with: jmap -dump:format=b,file=heap.bin <pid> Analyze the dump with Eclipse MAT, focusing on “Leak Suspects” or “Top Consumers”.

GC and threads

Monitor GC frequency with jstat. Excessive young GC may indicate a too‑small Eden; adjust -Xmn or -XX:SurvivorRatio. Long GC pauses can be diagnosed by examining GC logs (e.g., G1 phases: Root Scanning, Object Copy, Ref Proc).

Full GC often signals problems such as concurrent marking failures, promotion failures, or large object allocation failures. Reduce explicit System.gc() calls and consider enabling heap dumps before/after full GC with -XX:+HeapDumpBeforeFullGC and -XX:+HeapDumpAfterFullGC.

Network

Network issues are complex. Timeouts are divided into connection timeout, read/write timeout, and others (e.g., connectionAcquireTimeout, idleConnectionTimeout). Keep client timeouts shorter than server timeouts.

TCP queue overflow

Overflow can occur in the SYN queue or accept queue, leading to RST packets. Monitor with: netstat -s | egrep "listen|LISTEN" and ss -lnt Adjust queue sizes via backlog (or acceptCount in Tomcat) and system parameters somaxconn, tcp_max_syn_backlog.

RST packets

RST indicates an abrupt connection reset, often caused by closed ports, forced termination, or missing TCP state. Capture with tcpdump and analyze in Wireshark.

TIME_WAIT and CLOSE_WAIT

TIME_WAIT ensures delayed packets are handled; excessive counts can be mitigated by enabling net.ipv4.tcp_tw_reuse=1 and net.ipv4.tcp_tw_recycle=1 or adjusting tcp_max_tw_buckets.

CLOSE_WAIT usually results from applications not closing sockets properly; investigate with jstack to find blocked threads.

Source: https://fredal.xin/java-error-check

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Java CPU Memory GC

Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.