Operations 21 min read

Master Java Production Troubleshooting: CPU, Memory, Disk & Network Fixes

This guide walks you through systematic troubleshooting of Java production issues—covering CPU, memory, disk, GC, and network problems—by using tools like jstack, jmap, top, iostat, vmstat, and netstat to pinpoint root causes and apply targeted fixes.

ITFLY8 Architecture Home
ITFLY8 Architecture Home
ITFLY8 Architecture Home
Master Java Production Troubleshooting: CPU, Memory, Disk & Network Fixes

Online failures often involve CPU, disk, memory, and network issues, and most incidents contain multiple layers. When diagnosing, check these four aspects sequentially. Tools such as jstack, jmap, df, free, top, etc., can be used across layers; analyze each case specifically.

CPU

First check CPU problems. CPU anomalies are usually easy to locate. Causes include business logic loops, frequent GC, and excessive context switches. The most common is business or framework logic, which can be analyzed with jstack.

Using jstack to analyze CPU problems

Find the process PID with ps or top, then run top -H -p pid to locate high‑CPU threads. Convert the thread ID to hexadecimal with printf '%x\n' pid to get nid. Then search the jstack output for that nid: jstack pid | grep 'nid' -C5 --color Usually focus on threads in WAITING or TIMED_WAITING state; BLOCKED threads are less interesting. You can get an overview with:

cat jstack.log | grep "java.lang.Thread.State" | sort -nr | uniq -c

Frequent GC

Use jstat to see if GC is too frequent: jstat -gc pid 1000 Interpret S0C/S1C, EC, OC, MC, YGC/YGT, FGC/FGCT, GCT, etc. If GC is frequent, investigate further.

Context switches

Check context switches with vmstat, look at the cs column. For a specific PID, use: pidstat -w pid cswch and nvcswch show voluntary and involuntary switches.

Disk

Disk issues are similar to CPU. Check space with df -hl. For performance, use iostat: iostat -d -k -x Key columns: %util, rrqm/s, wrqm/s indicate utilization and read/write rates. Identify the process performing I/O with iotop, then map tid to pid via readlink -f /proc/*/task/tid/../.. and inspect its I/O with cat /proc/pid/io or lsof -p pid.

Memory

Memory problems include OOM, GC issues, and off‑heap memory. Start with free to view usage.

Heap memory

OOM can appear as:

Exception in thread "main" java.lang.OutOfMemoryError: unable to create new native thread

Usually caused by thread‑pool misuse; reduce thread stack size with Xss or increase OS limits in /etc/security/limits.conf.

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space

Heap reached -Xmx limit; look for leaks with jstack/jmap, then increase Xmx if needed.

Caused by: java.lang.OutOfMemoryError: Metaspace

Metaspace reached MaxMetaspaceSize; adjust with -XX:MaxMetaspaceSize (or -XX:MaxPermSize for older JDKs).

StackOverflow

Occurs when thread stack exceeds Xss. Reduce recursion or increase Xss, but beware of OOM.

Using JMAP to locate heap leaks

Dump the heap: jmap -dump:format=b,file=filename pid Analyze with MAT (Eclipse Memory Analyzer) using Leak Suspects, Top Consumers, Thread Overview, or Histogram.

GC issues and threads

Frequent GC can also increase CPU load. Use jstat to monitor generations. Full GC may be triggered by concurrent phase failure, promotion failure, or large object allocation failure. Adjust parameters such as -XX:ConcGCThreads, -XX:G1ReservePercent, -XX:InitiatingHeapOccupancyPercent, -XX:G1HeapRegionSize, or avoid explicit System.gc(). Enable GC logs with

-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps

and consider G1 collector ( -XX:+UseG1GC).

Network

Network problems are complex. Timeouts can be connection or read/write. Keep client timeout smaller than server timeout. Use tools like netstat and ss to monitor SYN/ACK queues, listen backlog, and TCP states.

TCP queue overflow

Two queues: SYN queue and accept queue. Overflow leads to RST packets. Check with netstat -s | egrep "listen|LISTEN" and ss -lnt. Adjust backlog (Tomcat acceptCount, Jetty acceptQueueSize) and OS parameters ( somaxconn, tcp_max_syn_backlog).

RST anomalies

RST indicates connection reset, often caused by closed ports, abrupt termination, or stray packets. Capture with tcpdump -i en0 tcp -w xxx.cap and analyze in Wireshark.

TIME_WAIT and CLOSE_WAIT

TIME_WAIT ensures proper closure; can be tuned with net.ipv4.tcp_tw_reuse=1 and net.ipv4.tcp_tw_recycle=1. CLOSE_WAIT often indicates application not closing sockets; investigate with jstack.

Source: https://fredal.xin/java-error-check

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

networktroubleshootingCPUMemorygcdisk
ITFLY8 Architecture Home
Written by

ITFLY8 Architecture Home

ITFLY8 Architecture Home - focused on architecture knowledge sharing and exchange, covering project management and product design. Includes large-scale distributed website architecture (high performance, high availability, caching, message queues...), design patterns, architecture patterns, big data, project management (SCRUM, PMP, Prince2), product design, and more.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.