Diagnosing and Optimizing JVM Memory Issues in a Core Service
This article details the identification, analysis, and resolution of JVM memory problems in a core music metadata service, covering GC tuning, large‑object handling, fault‑tolerance strategies, custom Dubbo codec monitoring, and non‑intrusive memory object tracking to improve performance and stability.
The article presents a case study of a core music‑metadata service (referred to as the core service) that suffered from frequent GC pauses, long response times, and RPC timeouts during peak traffic, which degraded the business functionality.
Log and monitoring analysis revealed that YGC occurred about 12 times per minute (peaking at 24) with an average pause of 327 ms, while FGC happened roughly once every ten minutes with a 30 s pause, indicating severe GC pressure. Heap usage spiked sharply (see Figure 2), and CPU usage remained stable, confirming that memory, not CPU, was the bottleneck.
Step 1 JVM Optimization
The default JDK 8 GC (Parallel Scavenge + Parallel Old) was unsuitable for the core service’s short‑lived, high‑object‑count workload. After experiments, the team switched to ParNew + CMS and increased the young generation size, resulting in the following parameter sets.
Default JVM parameters:
-Xms4096M -Xmx4096M -Xmn1024M -XX:MetaspaceSize=256M -Djava.security.egd=file:/dev/./urandom -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/data/{runuser}/logs/otherOptimized JVM parameters (4 CPU / 8 GB machine):
-Xms4096M -Xmx4096M -Xmn1536M -XX:MetaspaceSize=256M -XX:+UseConcMarkSweepGC -XX:+CMSScavengeBeforeRemark -Djava.security.egd=file:/dev/./urandom -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/data/{runuser}/logs/otherAfter applying the new settings, heap usage dropped noticeably (Figure 3), although Dubbo timeouts still persisted.
Step 2 Fault‑Tolerance Strategy
When the API layer detects an exception from the core service, the offending machine’s IP is reported to a monitoring platform. An alert rule triggers a callback that marks the IP as faulty; the API then removes the IP from Dubbo’s provider list via a custom implementation of AbstractLoadBalance. This automatic exclusion prevents further calls to the problematic node (Figure 4). Prior to this, operators manually restarted machines upon memory alerts.
Step 3 Large‑Object Optimization
Heap dumps showed a 9 MB response object occupying a large portion of the Netty task queue (258 MB total). Thread dumps identified frequent calls to ExchangeCodec.encodeResponse, indicating that large responses were being serialized and written to the buffer, causing long pauses.
Sample thread dump excerpt:
Thread 5612: (state = IN_JAVA)
- org.apache.dubbo.remoting.exchange.codec.ExchangeCodec.encodeResponse(...)
- org.apache.dubbo.remoting.exchange.codec.ExchangeCodec.encode(...)
- org.apache.dubbo.rpc.protocol.dubbo.DubboCountCodec.encode(...)
- io.netty.handler.codec.MessageToByteEncoder.write(...)
- ...Heap snapshot analysis (Figure 5) identified the offending interface that returned massive response objects, allowing the team to target and refactor that API.
Post‑optimization metrics showed a 76.5 % reduction in total YGC count and a 75.5 % drop in YGC pause time during peaks; FGC occurrences fell to once every three days with a 90.1 % reduction in pause time (Figure 6).
Step 4 Non‑Intrusive Memory Object Monitoring
Dubbo’s encodeResponse checks payload size and throws ExceedPayloadLimitException for oversized responses, which are then replaced by an empty response. To capture details of such events without affecting performance, a custom codec implementing Codec2 was introduced. It records the buffer position before and after encoding, calculates the encoded length, and logs warnings when the length exceeds a configured threshold.
// Custom Dubbo codec example
public class MusicDubboCountCodec implements Codec2 {
private static Cache<Long, String> EXCEED_PAYLOAD_LIMIT_CACHE = Caffeine.newBuilder()
.maximumSize(100)
.expireAfterWrite(300, TimeUnit.SECONDS)
.softValues()
.build();
@Override
public void encode(Channel channel, ChannelBuffer buffer, Object message) throws IOException {
int writeBefore = buffer == null ? 0 : buffer.writerIndex();
dubboCountCodec.encode(channel, buffer, message);
checkOverPayload(message);
int writeAfter = buffer == null ? 0 : buffer.writerIndex();
int length = writeAfter - writeBefore;
warningLengthTooLong(length, message);
}
private void checkOverPayload(Object message) {
if (!(message instanceof Response)) return;
Response response = (Response) message;
if (Response.BAD_RESPONSE == response.getStatus() &&
StrUtil.contains(response.getErrorMessage(), OVER_PAYLOAD_ERROR_MESSAGE)) {
EXCEED_PAYLOAD_LIMIT_CACHE.put(response.getId(), response.getErrorMessage());
return;
}
if (Response.OK == response.getStatus() &&
EXCEED_PAYLOAD_LIMIT_CACHE.getIfPresent(response.getId()) != null) {
String responseMessage = getResponseMessage(response);
log.warn("dubbo序列化对象大小超过payload,errorMsg is {},response is {}",
EXCEED_PAYLOAD_LIMIT_CACHE.getIfPresent(response.getId()), responseMessage);
}
}
}With this monitoring in place, the team quickly identified additional large‑object APIs and began targeted optimizations.
Conclusion
Effective JVM memory tuning requires a combination of log analysis, monitoring, heap/stack inspection, and careful code changes. Introducing lightweight, non‑intrusive object monitoring helps locate problematic payloads without impacting the production system, and continuous refactoring, caching, and scheduled tasks further improve stability and performance.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
High Availability Architecture
Official account for High Availability Architecture.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
