Operations 22 min read

Comprehensive Guide to Java Runtime Error Checking and Troubleshooting (CPU, Memory, Disk, Network, GC)

This article provides a systematic, step‑by‑step guide for diagnosing and resolving Java runtime problems—including CPU spikes, memory leaks, disk I/O bottlenecks, network timeouts, and GC inefficiencies—by using native Linux tools and JVM utilities such as top, ps, jstack, jmap, jstat, iostat, vmstat, pidstat, netstat, ss, and tcpdump.

Top Architect
Top Architect
Top Architect
Comprehensive Guide to Java Runtime Error Checking and Troubleshooting (CPU, Memory, Disk, Network, GC)

Online incidents usually involve CPU, disk, memory, and network problems, and most faults span multiple layers; therefore a systematic inspection of these four aspects is recommended, using tools like df, free, top, jstack, and jmap to pinpoint the root cause.

CPU

Start by checking CPU usage; abnormal CPU is often caused by business logic loops, frequent GC, or excessive context switches. The most common cause is a logic error that can be investigated with jstack.

Using jstack to analyze CPU issues

Identify the process PID with ps (or top if multiple processes). List threads with high CPU using top -H -p pid. Convert the thread ID to hexadecimal with printf '%x\n' to obtain the nid, then locate the corresponding stack in the jstack output: jstack pid | grep 'nid' -C5 --color Focus on threads in WAITING or TIMED_WAITING states; you can get an overview with:

cat jstack.log | grep "java.lang.Thread.State" | sort -nr | uniq -c

Frequent GC

Use jstat -gc pid 1000 to monitor GC activity. The columns S0C/S1C, S0U/S1U, EC/EU, OC/OU, MC/MU represent survivor, Eden, old, and metaspace capacities and usage; YGC/YGT, FGC/FGCT, GCT show GC counts and times. If GC is too frequent, investigate the underlying cause.

Context Switches

Inspect context‑switch counts with vmstat. The cs column shows total switches; monitor a specific PID with pidstat -w pid, where cswch and nvcswch indicate voluntary and involuntary switches.

Disk

Check disk space with df -hl. For performance issues, use iostat -d -k -x and pay attention to the %util, rrqm/s, and wrqm/s columns to identify saturated devices. Identify the process performing I/O with iotop or by converting a thread ID to PID via readlink -f /proc/*/task/tid/../.., then examine its I/O stats with cat /proc/pid/io or lsof -p pid.

Memory

Start with free to view overall memory. Most memory problems are heap‑related, manifesting as OOM or StackOverflow errors.

OutOfMemoryError: unable to create new native thread

This indicates insufficient native memory for thread stacks; check code for thread‑leakage, reduce Xss, or raise OS limits ( /etc/security/limits.conf).

OutOfMemoryError: Java heap space

Heap has reached the -Xmx limit; look for leaks with jstack and jmap, then increase Xmx if necessary.

OutOfMemoryError: Metaspace

Metaspace hit MaxMetaspaceSize; adjust with -XX:MaxMetaspaceSize (or -XX:MaxPermSize for pre‑Java 8).

StackOverflowError

Thread stack exceeds Xss; reduce recursion depth or increase Xss cautiously.

Using JMAP to locate heap leaks

Generate a heap dump with jmap -dump:format=b,file=heap.hprof pid and analyze it with Eclipse MAT (Memory Analyzer Tool), focusing on "Leak Suspects" or "Top Consumers".

Native Memory (off‑heap) issues

Detect off‑heap growth with pmap -x pid | sort -rn -k3 | head -30. Use gdb to dump suspicious regions, then inspect with hexdump -C. Enable Native Memory Tracking with -XX:NativeMemoryTracking=summary or detail and query via jcmd pid VM.native_memory. Adjust -XX:MaxDirectMemorySize if needed.

GC Problems

GC issues often accompany memory problems and can also cause CPU spikes. Enable detailed GC logging with

-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps

. For G1, monitor young and full GC frequency and pause times. If young GC is too frequent, tune -Xmn and -XX:SurvivorRatio. If full GC occurs often, consider increasing heap size, adjusting -XX:G1ReservePercent, or reducing -XX:InitiatingHeapOccupancyPercent. Use jinfo -flag +HeapDumpBeforeFullGC pid and jinfo -flag +HeapDumpAfterFullGC pid to compare dumps.

Network

Network issues are complex and include timeouts, TCP queue overflows, RST packets, TIME_WAIT, and CLOSE_WAIT states.

Timeouts

Distinguish connection timeout (client‑side max time to establish a socket) from read/write timeout (data transfer). Keep client timeouts shorter than server limits. Misconfigured timeouts can cause excessive connection usage.

TCP Queue Overflow

Two queues exist: SYN backlog (half‑open) and accept queue (full). When the accept queue is full, the kernel may send RST packets. Monitor with netstat -s | egrep "listen|LISTEN" and ss -lnt. Adjust backlog / acceptCount (Tomcat) or acceptQueueSize (Jetty) as needed.

RST Abnormalities

RST indicates a reset, often caused by port‑not‑listening, abrupt termination, or kernel‑level queue overflow. Capture RST traffic with tcpdump -i eth0 tcp -w capture.cap and analyze in Wireshark.

TIME_WAIT and CLOSE_WAIT

TIME_WAIT ensures proper connection teardown; excessive TIME_WAIT can be mitigated by enabling net.ipv4.tcp_tw_reuse=1 and net.ipv4.tcp_tw_recycle=1. CLOSE_WAIT usually stems from applications not closing sockets; investigate with jstack to find threads blocked on I/O or latch waits.

Overall, combine system‑level commands (top, vmstat, iostat, netstat, ss, pidstat, pmap, strace) with JVM tools (jstack, jmap, jstat, jcmd, MAT) to locate the root cause of Java runtime failures.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

JavaperformancenetworkCPUMemorygc
Top Architect
Written by

Top Architect

Top Architect focuses on sharing practical architecture knowledge, covering enterprise, system, website, large‑scale distributed, and high‑availability architectures, plus architecture adjustments using internet technologies. We welcome idea‑driven, sharing‑oriented architects to exchange and learn together.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.