Comprehensive Guide to Java Production Issue Diagnosis and Performance Optimization

This article presents a thorough Java production troubleshooting workflow, covering essential knowledge, tools, and data analysis techniques, with detailed explanations of JVM garbage collection, profiling utilities, and real‑world case studies to help engineers quickly locate and resolve performance and stability problems.

Code Ape Tech Column
Code Ape Tech Column
Code Ape Tech Column
Comprehensive Guide to Java Production Issue Diagnosis and Performance Optimization

Hello everyone, I'm Chen.

Today I share a very useful article on problem diagnosis, including commonly used tools and knowledge points in production.

Online Issue Handling Process

Directly show a PPT screenshot; it is still relevant.

Problem Diagnosis

Three aspects can be considered:

Knowledge: Some problems can be solved by thinking, e.g., recalling that line 83 of the code is problematic.

Tools: When you cannot remember everything or the code is not yours, tools are needed to locate the issue.

Data: Runtime data can also provide many clues.

Knowledge

Knowledge includes many aspects, briefly listed below:

Language (specifically Java): JVM knowledge, multithreading, etc.

Frameworks: Dubbo, Spring, etc.

Components: MySQL, RocketMQ, etc.

Others: Network, operating system, etc.

For example, understanding the whole lifecycle of a Java object from allocation to reclamation is crucial; the diagram below is very clear and should be memorized.

Then also understand common garbage collectors:

Throughput = number of requests processed per unit time = runtime / (runtime + GC time)

Using ParNew + CMS as an example, answer the following questions:

Why use generational collection? – Keyword: efficiency.

When does an object move to the old generation? – Keywords: age, size.

When do Young GC and Full GC occur? – Keywords: Eden shortage, Old shortage, Metaspace shortage, System.gc, etc.

If we understand the above knowledge, consider a practical case: when Young GC is triggered frequently with high latency, how to optimize?

First, think: Young GC is triggered when Eden space is insufficient.

Second, the main cost of Young GC is scanning + copying; scanning is fast, copying is slower.

Thus, increase the young generation size. The result improves because copying time dominates.

Assume young generation size is M, object survival time 750 ms, Young GC interval 500 ms, scan time T1, copy time T2:

When size = M: frequency 2 times/s, each takes T1 + T2.

When size = 2M: frequency 1 time/s, each takes 2 T1.

Since T2 » T1, 2 T1 < T1 + T2, so the optimization works.

Tools

Java provides several categories of tools:

JDK built‑ins: jstat, jstack, jmap, jconsole, jvisualvm.

Third‑party: MAT (Eclipse plugin), GCHisto, GCeasy (online GC log analysis).

Open source: Arthas, bistoury, async‑profiler.

Understanding their principles is helpful. For CPU profilers there are two main types:

Sampling: low overhead but limited frequency and may suffer from SafePoint bias.

Instrumentation: inserts AOP logic into every method, accurate but high overhead.

For example, Uber’s open‑source uber-common/jvm-profiler is a sampling profiler that suffers from SafePoint bias. In one CPU‑usage investigation, the flame graph collected was almost useless.

SafePoint (safety point) is a specific location where the JVM can pause; if sampling occurs only at SafePoints, the sample may not represent actual CPU consumption, leading to SafePoint bias.

Using async-profiler instead avoids SafePoint bias because it leverages the AsyncGetCallTrace technique. After optimizing based on its flame graph, QPS increased from 58 k to 81 k and CPU usage dropped from 72 % to 41 %.

Data

Data includes:

Monitoring data such as APM, metrics, JVM monitoring, distributed tracing, etc.

Runtime data such as business data, access logs, GC logs, system logs, etc.

This part is analyzed case‑by‑case; there is no universal template.

Experience

From experience, common problems can be approached as follows:

Execution exceptions: check logs, debug, replay requests.

Application hangs: use jstack.

High latency: trace, benchmark.

High CPU usage: CPU profiling.

Frequent or slow GC: analyze GC logs.

OOM or high memory usage: dump and analyze memory.

Case Studies

Cobar Hang – Port Open but No Requests Processed

First, remove the faulty machine, preserve the scene, then investigate; logs point to a memory leak.

Question: Can the exact leak location be determined directly from logs? – Answer: No.

Dump the memory for offline analysis; if the file is large, compress it first. jmap -dump:format=b,file=/cobar.bin ${pid} Analyze the dump with Eclipse MAT; the root cause was a custom modification in Cobar that introduced a bug. For more memory‑analysis articles, see:

"A Long Dubbo Gateway Memory Leak Investigation"

"SkyWalking Memory Leak Investigation"

Gateway High Latency

Use Arthas trace to follow the call chain:

trace com.beibei.airborne.embed.extension.PojoUtils generalize

Sentinel Integration Causing Application Hang

After adding a rate‑limiting rule in Sentinel, the application hangs; jstack quickly reveals the problem.

jstack ${pid} > jstack.txt
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

BackendDebuggingJavagcProfiling
Code Ape Tech Column
Written by

Code Ape Tech Column

Former Ant Group P8 engineer, pure technologist, sharing full‑stack Java, job interview and career advice through a column. Site: java-family.cn

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.