Operations 8 min read

Uncovering Java Call Latency Spikes: Memory, GC, and Network Bottlenecks

A Java service experienced occasional five‑minute latency spikes despite similar provider response times, prompting a systematic investigation of container memory usage, page‑cache behavior, young‑generation GC pauses, and network bottlenecks, ultimately revealing and mitigating the root causes.

JD Cloud Developers

Dec 16, 2024

Uncovering Java Call Latency Spikes: Memory, GC, and Network Bottlenecks

Phenomenon

In most cases the caller’s latency and the provider’s latency are similar, but occasionally the caller experiences latency far higher than the provider, up to five minutes with more than 20 occurrences.

Monitoring Added

Both caller and provider added monitoring around the JSF interface without any additional logic.

Investigation Steps

1. Data‑flow analysis

The request path includes:

Caller container and host

Network between caller and provider

Provider container and host

Network from provider back to caller

2. Initial hypothesis

Potential bottlenecks in container/host resources, network fluctuations, or other layers; start by examining the network.

3. Evidence gathering

3.1 Monitoring

Found no network monitoring; consulted JDOS team, who suggested checking container memory usage.

Container memory usage (including cache) consistently stays above 99 %.

3.1.2 Metric meaning

The metric combines RSS (actual physical memory used by processes) and Page Cache (disk‑file data cached in memory to improve I/O performance).

For Java applications, page cache does not affect the effective memory limit because the kernel can reclaim it when needed.

3.1.3 Reducing container memory usage

Examined other Java clusters and observed periodic drops in memory usage aligned with log‑cleanup intervals.

After log cleanup on the provider side, memory usage decreased, though latency spikes persisted.

3.2 Container processing bottleneck

CPU and memory remained normal before and after scaling the provider from 4 to 8 nodes.

Scaling did not noticeably improve caller latency.

3.3 Latency analysis

Operations team identified higher young‑generation GC (yangGC) pause times as a possible contributor.

Correlation between yangGC pauses and caller latency was observed, though data granularity is coarse (minute‑level).

3.4 Network capture and PFinder

Capturing packets across all caller and provider machines is impractical; instead, a single caller‑provider pair was selected for packet capture while monitoring UMP for spikes.

When UMP shows a spike, check PFinder data; if absent, continue capturing.

Successful capture revealed:

Caller sent request at 22:24:50.775730, received response at 22:24:50.988867 (213 ms).

Provider received packet at 22:24:50.775723, processed it by 22:24:50.983, and responded at 22:24:50.988776, totaling ~208 ms processing plus 4.55 ms handling, matching the caller’s observed latency.

Root‑cause hypotheses

Container resource bottleneck (CPU/memory normal, scaling ineffective).

yangGC pauses adding delay.

Mitigation

Goal

Reduce yangGC pause time (no Full GC observed).

Approach

Increase young‑generation heap size.

Scale out (already attempted).

Redirect MQ consumption to other groups to lower object allocation.

Result

After adjustments, caller and provider latency charts aligned, and the discrepancy was resolved.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Java performance container GC

Written by

JD Cloud Developers

JD Cloud Developers (Developer of JD Technology) is a JD Technology Group platform offering technical sharing and communication for AI, cloud computing, IoT and related developers. It publishes JD product technical information, industry content, and tech event news. Embrace technology and partner with developers to envision the future.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.