Comprehensive Guide to Java Runtime Error Checking and Troubleshooting (CPU, Memory, Disk, Network, GC)
This article provides a systematic, step‑by‑step guide for diagnosing and resolving Java runtime problems—including CPU spikes, memory leaks, disk I/O bottlenecks, network timeouts, and GC inefficiencies—by using native Linux tools and JVM utilities such as top, ps, jstack, jmap, jstat, iostat, vmstat, pidstat, netstat, ss, and tcpdump.
Online incidents usually involve CPU, disk, memory, and network problems, and most faults span multiple layers; therefore a systematic inspection of these four aspects is recommended, using tools like df, free, top, jstack, and jmap to pinpoint the root cause.
CPU
Start by checking CPU usage; abnormal CPU is often caused by business logic loops, frequent GC, or excessive context switches. The most common cause is a logic error that can be investigated with jstack.
Using jstack to analyze CPU issues
Identify the process PID with ps (or top if multiple processes). List threads with high CPU using top -H -p pid. Convert the thread ID to hexadecimal with printf '%x\n' to obtain the nid, then locate the corresponding stack in the jstack output: jstack pid | grep 'nid' -C5 --color Focus on threads in WAITING or TIMED_WAITING states; you can get an overview with:
cat jstack.log | grep "java.lang.Thread.State" | sort -nr | uniq -cFrequent GC
Use jstat -gc pid 1000 to monitor GC activity. The columns S0C/S1C, S0U/S1U, EC/EU, OC/OU, MC/MU represent survivor, Eden, old, and metaspace capacities and usage; YGC/YGT, FGC/FGCT, GCT show GC counts and times. If GC is too frequent, investigate the underlying cause.
Context Switches
Inspect context‑switch counts with vmstat. The cs column shows total switches; monitor a specific PID with pidstat -w pid, where cswch and nvcswch indicate voluntary and involuntary switches.
Disk
Check disk space with df -hl. For performance issues, use iostat -d -k -x and pay attention to the %util, rrqm/s, and wrqm/s columns to identify saturated devices. Identify the process performing I/O with iotop or by converting a thread ID to PID via readlink -f /proc/*/task/tid/../.., then examine its I/O stats with cat /proc/pid/io or lsof -p pid.
Memory
Start with free to view overall memory. Most memory problems are heap‑related, manifesting as OOM or StackOverflow errors.
OutOfMemoryError: unable to create new native thread
This indicates insufficient native memory for thread stacks; check code for thread‑leakage, reduce Xss, or raise OS limits ( /etc/security/limits.conf).
OutOfMemoryError: Java heap space
Heap has reached the -Xmx limit; look for leaks with jstack and jmap, then increase Xmx if necessary.
OutOfMemoryError: Metaspace
Metaspace hit MaxMetaspaceSize; adjust with -XX:MaxMetaspaceSize (or -XX:MaxPermSize for pre‑Java 8).
StackOverflowError
Thread stack exceeds Xss; reduce recursion depth or increase Xss cautiously.
Using JMAP to locate heap leaks
Generate a heap dump with jmap -dump:format=b,file=heap.hprof pid and analyze it with Eclipse MAT (Memory Analyzer Tool), focusing on "Leak Suspects" or "Top Consumers".
Native Memory (off‑heap) issues
Detect off‑heap growth with pmap -x pid | sort -rn -k3 | head -30. Use gdb to dump suspicious regions, then inspect with hexdump -C. Enable Native Memory Tracking with -XX:NativeMemoryTracking=summary or detail and query via jcmd pid VM.native_memory. Adjust -XX:MaxDirectMemorySize if needed.
GC Problems
GC issues often accompany memory problems and can also cause CPU spikes. Enable detailed GC logging with
-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps. For G1, monitor young and full GC frequency and pause times. If young GC is too frequent, tune -Xmn and -XX:SurvivorRatio. If full GC occurs often, consider increasing heap size, adjusting -XX:G1ReservePercent, or reducing -XX:InitiatingHeapOccupancyPercent. Use jinfo -flag +HeapDumpBeforeFullGC pid and jinfo -flag +HeapDumpAfterFullGC pid to compare dumps.
Network
Network issues are complex and include timeouts, TCP queue overflows, RST packets, TIME_WAIT, and CLOSE_WAIT states.
Timeouts
Distinguish connection timeout (client‑side max time to establish a socket) from read/write timeout (data transfer). Keep client timeouts shorter than server limits. Misconfigured timeouts can cause excessive connection usage.
TCP Queue Overflow
Two queues exist: SYN backlog (half‑open) and accept queue (full). When the accept queue is full, the kernel may send RST packets. Monitor with netstat -s | egrep "listen|LISTEN" and ss -lnt. Adjust backlog / acceptCount (Tomcat) or acceptQueueSize (Jetty) as needed.
RST Abnormalities
RST indicates a reset, often caused by port‑not‑listening, abrupt termination, or kernel‑level queue overflow. Capture RST traffic with tcpdump -i eth0 tcp -w capture.cap and analyze in Wireshark.
TIME_WAIT and CLOSE_WAIT
TIME_WAIT ensures proper connection teardown; excessive TIME_WAIT can be mitigated by enabling net.ipv4.tcp_tw_reuse=1 and net.ipv4.tcp_tw_recycle=1. CLOSE_WAIT usually stems from applications not closing sockets; investigate with jstack to find threads blocked on I/O or latch waits.
Overall, combine system‑level commands (top, vmstat, iostat, netstat, ss, pidstat, pmap, strace) with JVM tools (jstack, jmap, jstat, jcmd, MAT) to locate the root cause of Java runtime failures.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Top Architect
Top Architect focuses on sharing practical architecture knowledge, covering enterprise, system, website, large‑scale distributed, and high‑availability architectures, plus architecture adjustments using internet technologies. We welcome idea‑driven, sharing‑oriented architects to exchange and learn together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
