Operations 21 min read

Comprehensive Guide to Diagnosing Online Failures: CPU, Memory, Disk, GC, and Network Issues

This article provides a step‑by‑step methodology for troubleshooting online service failures by systematically examining CPU, disk, memory (including heap, OOM, stack overflow, and off‑heap), garbage collection, and network problems using tools such as ps, top, jstack, jmap, jstat, iostat, vmstat, strace, and tcpdump.

Code Ape Tech Column
Code Ape Tech Column
Code Ape Tech Column
Comprehensive Guide to Diagnosing Online Failures: CPU, Memory, Disk, GC, and Network Issues

Online failures often involve multiple layers such as CPU, disk, memory, and network; therefore, when diagnosing issues, it is advisable to check each aspect in order.

Common diagnostic tools like jstack and jmap are not limited to a single problem domain. Typically, start with df, free, and top, then proceed to jstack and jmap for deeper analysis.

CPU

CPU problems are usually easier to locate. Causes include business logic loops, frequent GC, and excessive context switches. Use ps to find the process ID, then top -H -p <pid> to identify high‑CPU threads. Convert the thread ID to hexadecimal with printf '%x\n' <tid> and search the stack with jstack <pid> | grep nid -C5. Analyze the stack for suspicious threads, focusing on WAITING and TIMED_WAITING states.

Disk

Check disk space with df -hl. For performance issues, use iostat -d -k -x to view utilization, read/write rates, and %util. Identify the responsible process with iotop or lsof -p <pid>. Convert a thread ID to a process ID via readlink -f /proc/*/task/<tid>/../.. and inspect I/O statistics with cat /proc/<pid>/io.

Memory

Memory problems include OOM, GC issues, and off‑heap usage. Start with free to view overall memory. Heap OOM can manifest as

java.lang.OutOfMemoryError: unable to create new native thread

or Java heap space; investigate with jstack and jmap. Adjust -Xss for thread stack size or increase OS limits via /etc/security/limits.conf and nproc. Meta‑space OOM can be mitigated by increasing -XX:MaxMetaspaceSize.

StackOverflow errors indicate excessive stack usage; resolve by adjusting -Xss or fixing recursive code.

Use jmap -dump:format=b,file=heap.hprof <pid> to generate a heap dump, then analyze with Eclipse MAT ( jmap -histo:live, jmap -dump) to locate memory leaks, focusing on the Leak Suspects report.

GC Issues

GC problems affect CPU and memory. Monitor generation changes with jstat -gc <pid> 1000. Enable detailed GC logs using

-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps

. For G1, watch for frequent Young GC, long GC pauses, or Full GC triggers such as concurrent phase failures, promotion failures, or large object allocation failures. Adjust parameters like -Xmn, -XX:SurvivorRatio, -XX:G1ReservePercent, -XX:InitiatingHeapOccupancyPercent, and -XX:MaxDirectMemorySize as needed.

Dump heap before and after Full GC with jinfo -flag +HeapDumpBeforeFullGC <pid> and jinfo -flag +HeapDumpAfterFullGC <pid> for comparative analysis.

Network

Network issues are complex and include timeouts, TCP queue overflows, RST packets, TIME_WAIT, and CLOSE_WAIT states. Distinguish between connection timeout (client‑side) and read/write timeout (application‑side). Adjust timeout settings so client values are smaller than server values.

TCP queue overflow can be diagnosed with netstat -s | egrep "listen|LISTEN" (overflowed and sockets dropped counters) and ss -lnt (listen queue usage). Tune kernel parameters net.ipv4.tcp_tw_reuse=1 and net.ipv4.tcp_tw_recycle=1 to recycle TIME_WAIT sockets, or reduce tcp_max_tw_buckets if necessary.

RST packets indicate abnormal connection termination; capture them with tcpdump -i en0 tcp -w capture.cap and analyze in Wireshark. TIME_WAIT and CLOSE_WAIT states can be monitored via netstat -n or ss -ant. Excessive CLOSE_WAIT often points to application code that fails to close sockets after ACK; use jstack to locate blocked threads.

Additional Tools

Use strace -f -e "brk,mmap,munmap" -p <pid> to trace memory allocation system calls, and

gdb --batch --pid <pid> -ex "dump memory file.dump <addr> <addr+size>"

for low‑level memory inspection.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

JavatroubleshootingCPUMemorygc
Code Ape Tech Column
Written by

Code Ape Tech Column

Former Ant Group P8 engineer, pure technologist, sharing full‑stack Java, job interview and career advice through a column. Site: java-family.cn

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.