Uncovering Java Call Latency Spikes: Memory, GC, and Network Bottlenecks
A Java service experienced occasional five‑minute latency spikes despite similar provider response times, prompting a systematic investigation of container memory usage, page‑cache behavior, young‑generation GC pauses, and network bottlenecks, ultimately revealing and mitigating the root causes.
Phenomenon
In most cases the caller’s latency and the provider’s latency are similar, but occasionally the caller experiences latency far higher than the provider, up to five minutes with more than 20 occurrences.
Monitoring Added
Both caller and provider added monitoring around the JSF interface without any additional logic.
Investigation Steps
1. Data‑flow analysis
The request path includes:
Caller container and host
Network between caller and provider
Provider container and host
Network from provider back to caller
2. Initial hypothesis
Potential bottlenecks in container/host resources, network fluctuations, or other layers; start by examining the network.
3. Evidence gathering
3.1 Monitoring
Found no network monitoring; consulted JDOS team, who suggested checking container memory usage.
Container memory usage (including cache) consistently stays above 99 %.
3.1.2 Metric meaning
The metric combines RSS (actual physical memory used by processes) and Page Cache (disk‑file data cached in memory to improve I/O performance).
For Java applications, page cache does not affect the effective memory limit because the kernel can reclaim it when needed.
3.1.3 Reducing container memory usage
Examined other Java clusters and observed periodic drops in memory usage aligned with log‑cleanup intervals.
After log cleanup on the provider side, memory usage decreased, though latency spikes persisted.
3.2 Container processing bottleneck
CPU and memory remained normal before and after scaling the provider from 4 to 8 nodes.
Scaling did not noticeably improve caller latency.
3.3 Latency analysis
Operations team identified higher young‑generation GC (yangGC) pause times as a possible contributor.
Correlation between yangGC pauses and caller latency was observed, though data granularity is coarse (minute‑level).
3.4 Network capture and PFinder
Capturing packets across all caller and provider machines is impractical; instead, a single caller‑provider pair was selected for packet capture while monitoring UMP for spikes.
When UMP shows a spike, check PFinder data; if absent, continue capturing.
Successful capture revealed:
Caller sent request at 22:24:50.775730, received response at 22:24:50.988867 (213 ms).
Provider received packet at 22:24:50.775723, processed it by 22:24:50.983, and responded at 22:24:50.988776, totaling ~208 ms processing plus 4.55 ms handling, matching the caller’s observed latency.
Root‑cause hypotheses
Container resource bottleneck (CPU/memory normal, scaling ineffective).
yangGC pauses adding delay.
Mitigation
Goal
Reduce yangGC pause time (no Full GC observed).
Approach
Increase young‑generation heap size.
Scale out (already attempted).
Redirect MQ consumption to other groups to lower object allocation.
Result
After adjustments, caller and provider latency charts aligned, and the discrepancy was resolved.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
JD Cloud Developers
JD Cloud Developers (Developer of JD Technology) is a JD Technology Group platform offering technical sharing and communication for AI, cloud computing, IoT and related developers. It publishes JD product technical information, industry content, and tech event news. Embrace technology and partner with developers to envision the future.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
