Master Java Production Troubleshooting: CPU, Memory, Disk & Network Fixes
This guide walks you through systematic troubleshooting of Java production issues—covering CPU, memory, disk, GC, and network problems—by using tools like jstack, jmap, top, iostat, vmstat, and netstat to pinpoint root causes and apply targeted fixes.
Online failures often involve CPU, disk, memory, and network issues, and most incidents contain multiple layers. When diagnosing, check these four aspects sequentially. Tools such as jstack, jmap, df, free, top, etc., can be used across layers; analyze each case specifically.
CPU
First check CPU problems. CPU anomalies are usually easy to locate. Causes include business logic loops, frequent GC, and excessive context switches. The most common is business or framework logic, which can be analyzed with jstack.
Using jstack to analyze CPU problems
Find the process PID with ps or top, then run top -H -p pid to locate high‑CPU threads. Convert the thread ID to hexadecimal with printf '%x\n' pid to get nid. Then search the jstack output for that nid: jstack pid | grep 'nid' -C5 --color Usually focus on threads in WAITING or TIMED_WAITING state; BLOCKED threads are less interesting. You can get an overview with:
cat jstack.log | grep "java.lang.Thread.State" | sort -nr | uniq -cFrequent GC
Use jstat to see if GC is too frequent: jstat -gc pid 1000 Interpret S0C/S1C, EC, OC, MC, YGC/YGT, FGC/FGCT, GCT, etc. If GC is frequent, investigate further.
Context switches
Check context switches with vmstat, look at the cs column. For a specific PID, use: pidstat -w pid cswch and nvcswch show voluntary and involuntary switches.
Disk
Disk issues are similar to CPU. Check space with df -hl. For performance, use iostat: iostat -d -k -x Key columns: %util, rrqm/s, wrqm/s indicate utilization and read/write rates. Identify the process performing I/O with iotop, then map tid to pid via readlink -f /proc/*/task/tid/../.. and inspect its I/O with cat /proc/pid/io or lsof -p pid.
Memory
Memory problems include OOM, GC issues, and off‑heap memory. Start with free to view usage.
Heap memory
OOM can appear as:
Exception in thread "main" java.lang.OutOfMemoryError: unable to create new native thread
Usually caused by thread‑pool misuse; reduce thread stack size with Xss or increase OS limits in /etc/security/limits.conf.
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
Heap reached -Xmx limit; look for leaks with jstack/jmap, then increase Xmx if needed.
Caused by: java.lang.OutOfMemoryError: Metaspace
Metaspace reached MaxMetaspaceSize; adjust with -XX:MaxMetaspaceSize (or -XX:MaxPermSize for older JDKs).
StackOverflow
Occurs when thread stack exceeds Xss. Reduce recursion or increase Xss, but beware of OOM.
Using JMAP to locate heap leaks
Dump the heap: jmap -dump:format=b,file=filename pid Analyze with MAT (Eclipse Memory Analyzer) using Leak Suspects, Top Consumers, Thread Overview, or Histogram.
GC issues and threads
Frequent GC can also increase CPU load. Use jstat to monitor generations. Full GC may be triggered by concurrent phase failure, promotion failure, or large object allocation failure. Adjust parameters such as -XX:ConcGCThreads, -XX:G1ReservePercent, -XX:InitiatingHeapOccupancyPercent, -XX:G1HeapRegionSize, or avoid explicit System.gc(). Enable GC logs with
-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCTimeStampsand consider G1 collector ( -XX:+UseG1GC).
Network
Network problems are complex. Timeouts can be connection or read/write. Keep client timeout smaller than server timeout. Use tools like netstat and ss to monitor SYN/ACK queues, listen backlog, and TCP states.
TCP queue overflow
Two queues: SYN queue and accept queue. Overflow leads to RST packets. Check with netstat -s | egrep "listen|LISTEN" and ss -lnt. Adjust backlog (Tomcat acceptCount, Jetty acceptQueueSize) and OS parameters ( somaxconn, tcp_max_syn_backlog).
RST anomalies
RST indicates connection reset, often caused by closed ports, abrupt termination, or stray packets. Capture with tcpdump -i en0 tcp -w xxx.cap and analyze in Wireshark.
TIME_WAIT and CLOSE_WAIT
TIME_WAIT ensures proper closure; can be tuned with net.ipv4.tcp_tw_reuse=1 and net.ipv4.tcp_tw_recycle=1. CLOSE_WAIT often indicates application not closing sockets; investigate with jstack.
Source: https://fredal.xin/java-error-check
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITFLY8 Architecture Home
ITFLY8 Architecture Home - focused on architecture knowledge sharing and exchange, covering project management and product design. Includes large-scale distributed website architecture (high performance, high availability, caching, message queues...), design patterns, architecture patterns, big data, project management (SCRUM, PMP, Prince2), product design, and more.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
