Operations 22 min read

Diagnosing Common Java Server Issues: CPU, Memory, Disk & Network

This guide walks through systematic troubleshooting of Java server problems—including CPU spikes, memory leaks, disk I/O bottlenecks, and network timeouts—by using native Linux tools and JVM utilities such as ps, top, jstack, jstat, iostat, vmstat, and netstat to pinpoint root causes and apply targeted fixes.

Efficient Ops

Apr 27, 2021

Diagnosing Common Java Server Issues: CPU, Memory, Disk & Network

Online faults usually involve CPU, disk, memory, and network problems; most incidents contain multiple layers, so a systematic check of these four aspects is recommended. Tools like df, free, top, jstack and jmap are useful for initial diagnosis.

CPU

Typical CPU issues stem from business logic errors (e.g., infinite loops), frequent GC, or excessive context switches. Use ps to locate the process ID, then run top -H -p PID to find threads with high CPU usage.

Convert the PID to hexadecimal with printf '%x\n' PID to obtain the NID, and search the stack trace in jstack: jstack PID | grep 'nid' -C5 --color Focus on threads in WAITING or TIMED_WAITING states; you can get an overview with:

cat jstack.log | grep "java.lang.Thread.State" | sort -nr | uniq -c

Frequent GC can be examined with jstat -gc PID 1000, which shows generation statistics such as S0C/S0U , S1C/S1U , EC/EU , OC/OU , MC/MU , as well as YGC/YGT , FGC/FGCT , and GCT .

Context Switches

Use vmstat to view the cs column (context switches). For a specific PID, pidstat -w PID shows voluntary and involuntary switches ( cswch and nvcswch ).

Disk

Check disk space with df -hl. For performance analysis, use iostat -d -k -x and examine the %util column and rrqm/s / wrqm/s to locate the problematic disk.

Identify the responsible process with iotop. Convert a thread ID (tid) to a PID using: readlink -f /proc/*/task/TID/../.. Then inspect I/O details: cat /proc/PID/io List open files with lsof -p PID.

Memory

Start with free to view overall memory usage. Common memory problems include OOM (OutOfMemoryError) and StackOverflowError.

OOM Types

Unable to create new native thread : caused by thread‑pool misuse; check code, use jstack or jmap, and consider reducing -Xss or raising OS limits.

Java heap space : heap reached -Xmx limit; look for leaks with jstack / jmap before increasing -Xmx.

Metaspace : metaspace reached -XX:MaxMetaspaceSize; adjust with -XX:MaxMetaspaceSize (or -XX:MaxPermSize for older JDKs).

StackOverflowError : thread stack exceeds -Xss; adjust -Xss cautiously.

Generate a heap dump on OOM with -XX:+HeapDumpOnOutOfMemoryError and analyze it using MAT (Memory Analyzer Tool). Example commands: jmap -dump:format=b,file=heap.hprof PID In MAT, start with Leak Suspects or Top Consumers to locate the leak.

GC Issues

Enable detailed GC logs with

-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps

. Frequent young GC usually indicates many short‑lived objects; consider increasing -Xmn or adjusting -XX:SurvivorRatio. Long‑running young GC requires analysis of log phases such as Root Scanning, Object Copy, and Ref Proc.

Full GC triggers include concurrent phase failure, promotion failure, large object allocation failure, and explicit System.gc(). Use jinfo to enable heap dumps before/after Full GC:

jinfo -flag +HeapDumpBeforeFullGC PID
jinfo -flag +HeapDumpAfterFullGC PID

Network

Network problems are complex and often the hardest to diagnose. Common categories are timeouts, TCP queue overflow, RST packets, TIME_WAIT, and CLOSE_WAIT.

Timeouts

Distinguish between connection timeout, read/write timeout, connection‑acquire timeout, and idle‑connection timeout. Keep client‑side timeouts shorter than server‑side values.

TCP Queue Overflow

Two queues exist: SYN queue (half‑open) and accept queue (full connections). When the accept queue is full, the kernel may drop the third handshake packet or send an RST, depending on tcp_abort_on_overflow.

Check overflow statistics with: netstat -s | egrep "listen|LISTEN" Inspect current queue sizes with: ss -lnt Backlog determines the accept queue size (Tomcat: acceptCount, Jetty: acceptQueueSize); the SYN queue size depends on /proc/sys/net/ipv4/tcp_max_syn_backlog.

RST Packets

RST indicates an abnormal connection reset, often caused by closed ports, explicit termination, or queue overflow. Capture packets with tcpdump and analyze in Wireshark.

TIME_WAIT and CLOSE_WAIT

TIME_WAIT ensures delayed packets are handled safely; excessive TIME_WAIT can be mitigated by enabling reuse and fast recycle:

net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_tw_recycle = 1

Adjust tcp_max_tw_buckets cautiously, as lowering it may cause "time wait bucket table overflow" errors.

CLOSE_WAIT usually results from applications not closing sockets properly; use jstack to locate threads stuck in await() or similar calls.

Overall, systematic use of Linux monitoring tools combined with JVM diagnostics enables rapid identification and resolution of performance bottlenecks across CPU, memory, disk, and network layers.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Java Performance Network Troubleshooting CPU Memory GC

Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.