Comprehensive Guide to Java Production Issue Diagnosis and Performance Optimization
This article presents a thorough Java production troubleshooting workflow, covering essential knowledge, tools, and data analysis techniques, with detailed explanations of JVM garbage collection, profiling utilities, and real‑world case studies to help engineers quickly locate and resolve performance and stability problems.
Hello everyone, I'm Chen.
Today I share a very useful article on problem diagnosis, including commonly used tools and knowledge points in production.
Online Issue Handling Process
Directly show a PPT screenshot; it is still relevant.
Problem Diagnosis
Three aspects can be considered:
Knowledge: Some problems can be solved by thinking, e.g., recalling that line 83 of the code is problematic.
Tools: When you cannot remember everything or the code is not yours, tools are needed to locate the issue.
Data: Runtime data can also provide many clues.
Knowledge
Knowledge includes many aspects, briefly listed below:
Language (specifically Java): JVM knowledge, multithreading, etc.
Frameworks: Dubbo, Spring, etc.
Components: MySQL, RocketMQ, etc.
Others: Network, operating system, etc.
For example, understanding the whole lifecycle of a Java object from allocation to reclamation is crucial; the diagram below is very clear and should be memorized.
Then also understand common garbage collectors:
Throughput = number of requests processed per unit time = runtime / (runtime + GC time)
Using ParNew + CMS as an example, answer the following questions:
Why use generational collection? – Keyword: efficiency.
When does an object move to the old generation? – Keywords: age, size.
When do Young GC and Full GC occur? – Keywords: Eden shortage, Old shortage, Metaspace shortage, System.gc, etc.
If we understand the above knowledge, consider a practical case: when Young GC is triggered frequently with high latency, how to optimize?
First, think: Young GC is triggered when Eden space is insufficient.
Second, the main cost of Young GC is scanning + copying; scanning is fast, copying is slower.
Thus, increase the young generation size. The result improves because copying time dominates.
Assume young generation size is M, object survival time 750 ms, Young GC interval 500 ms, scan time T1, copy time T2:
When size = M: frequency 2 times/s, each takes T1 + T2.
When size = 2M: frequency 1 time/s, each takes 2 T1.
Since T2 » T1, 2 T1 < T1 + T2, so the optimization works.
Tools
Java provides several categories of tools:
JDK built‑ins: jstat, jstack, jmap, jconsole, jvisualvm.
Third‑party: MAT (Eclipse plugin), GCHisto, GCeasy (online GC log analysis).
Open source: Arthas, bistoury, async‑profiler.
Understanding their principles is helpful. For CPU profilers there are two main types:
Sampling: low overhead but limited frequency and may suffer from SafePoint bias.
Instrumentation: inserts AOP logic into every method, accurate but high overhead.
For example, Uber’s open‑source uber-common/jvm-profiler is a sampling profiler that suffers from SafePoint bias. In one CPU‑usage investigation, the flame graph collected was almost useless.
SafePoint (safety point) is a specific location where the JVM can pause; if sampling occurs only at SafePoints, the sample may not represent actual CPU consumption, leading to SafePoint bias.
Using async-profiler instead avoids SafePoint bias because it leverages the AsyncGetCallTrace technique. After optimizing based on its flame graph, QPS increased from 58 k to 81 k and CPU usage dropped from 72 % to 41 %.
Data
Data includes:
Monitoring data such as APM, metrics, JVM monitoring, distributed tracing, etc.
Runtime data such as business data, access logs, GC logs, system logs, etc.
This part is analyzed case‑by‑case; there is no universal template.
Experience
From experience, common problems can be approached as follows:
Execution exceptions: check logs, debug, replay requests.
Application hangs: use jstack.
High latency: trace, benchmark.
High CPU usage: CPU profiling.
Frequent or slow GC: analyze GC logs.
OOM or high memory usage: dump and analyze memory.
Case Studies
Cobar Hang – Port Open but No Requests Processed
First, remove the faulty machine, preserve the scene, then investigate; logs point to a memory leak.
Question: Can the exact leak location be determined directly from logs? – Answer: No.
Dump the memory for offline analysis; if the file is large, compress it first. jmap -dump:format=b,file=/cobar.bin ${pid} Analyze the dump with Eclipse MAT; the root cause was a custom modification in Cobar that introduced a bug. For more memory‑analysis articles, see:
"A Long Dubbo Gateway Memory Leak Investigation"
"SkyWalking Memory Leak Investigation"
Gateway High Latency
Use Arthas trace to follow the call chain:
trace com.beibei.airborne.embed.extension.PojoUtils generalize
Sentinel Integration Causing Application Hang
After adding a rate‑limiting rule in Sentinel, the application hangs; jstack quickly reveals the problem.
jstack ${pid} > jstack.txt
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Code Ape Tech Column
Former Ant Group P8 engineer, pure technologist, sharing full‑stack Java, job interview and career advice through a column. Site: java-family.cn
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
