Diagnosing and Optimizing JVM Memory Issues in a Core Service

This article details the identification, analysis, and resolution of JVM memory problems in a core music metadata service, covering GC tuning, large‑object handling, fault‑tolerance strategies, custom Dubbo codec monitoring, and non‑intrusive memory object tracking to improve performance and stability.

High Availability Architecture
High Availability Architecture
High Availability Architecture
Diagnosing and Optimizing JVM Memory Issues in a Core Service

The article presents a case study of a core music‑metadata service (referred to as the core service) that suffered from frequent GC pauses, long response times, and RPC timeouts during peak traffic, which degraded the business functionality.

Log and monitoring analysis revealed that YGC occurred about 12 times per minute (peaking at 24) with an average pause of 327 ms, while FGC happened roughly once every ten minutes with a 30 s pause, indicating severe GC pressure. Heap usage spiked sharply (see Figure 2), and CPU usage remained stable, confirming that memory, not CPU, was the bottleneck.

Step 1 JVM Optimization

The default JDK 8 GC (Parallel Scavenge + Parallel Old) was unsuitable for the core service’s short‑lived, high‑object‑count workload. After experiments, the team switched to ParNew + CMS and increased the young generation size, resulting in the following parameter sets.

Default JVM parameters:

-Xms4096M -Xmx4096M -Xmn1024M -XX:MetaspaceSize=256M -Djava.security.egd=file:/dev/./urandom -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/data/{runuser}/logs/other

Optimized JVM parameters (4 CPU / 8 GB machine):

-Xms4096M -Xmx4096M -Xmn1536M -XX:MetaspaceSize=256M -XX:+UseConcMarkSweepGC -XX:+CMSScavengeBeforeRemark -Djava.security.egd=file:/dev/./urandom -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/data/{runuser}/logs/other

After applying the new settings, heap usage dropped noticeably (Figure 3), although Dubbo timeouts still persisted.

Step 2 Fault‑Tolerance Strategy

When the API layer detects an exception from the core service, the offending machine’s IP is reported to a monitoring platform. An alert rule triggers a callback that marks the IP as faulty; the API then removes the IP from Dubbo’s provider list via a custom implementation of AbstractLoadBalance. This automatic exclusion prevents further calls to the problematic node (Figure 4). Prior to this, operators manually restarted machines upon memory alerts.

Step 3 Large‑Object Optimization

Heap dumps showed a 9 MB response object occupying a large portion of the Netty task queue (258 MB total). Thread dumps identified frequent calls to ExchangeCodec.encodeResponse, indicating that large responses were being serialized and written to the buffer, causing long pauses.

Sample thread dump excerpt:

Thread 5612: (state = IN_JAVA)
 - org.apache.dubbo.remoting.exchange.codec.ExchangeCodec.encodeResponse(...)
 - org.apache.dubbo.remoting.exchange.codec.ExchangeCodec.encode(...)
 - org.apache.dubbo.rpc.protocol.dubbo.DubboCountCodec.encode(...)
 - io.netty.handler.codec.MessageToByteEncoder.write(...)
 - ...

Heap snapshot analysis (Figure 5) identified the offending interface that returned massive response objects, allowing the team to target and refactor that API.

Post‑optimization metrics showed a 76.5 % reduction in total YGC count and a 75.5 % drop in YGC pause time during peaks; FGC occurrences fell to once every three days with a 90.1 % reduction in pause time (Figure 6).

Step 4 Non‑Intrusive Memory Object Monitoring

Dubbo’s encodeResponse checks payload size and throws ExceedPayloadLimitException for oversized responses, which are then replaced by an empty response. To capture details of such events without affecting performance, a custom codec implementing Codec2 was introduced. It records the buffer position before and after encoding, calculates the encoded length, and logs warnings when the length exceeds a configured threshold.

// Custom Dubbo codec example
public class MusicDubboCountCodec implements Codec2 {
    private static Cache<Long, String> EXCEED_PAYLOAD_LIMIT_CACHE = Caffeine.newBuilder()
        .maximumSize(100)
        .expireAfterWrite(300, TimeUnit.SECONDS)
        .softValues()
        .build();

    @Override
    public void encode(Channel channel, ChannelBuffer buffer, Object message) throws IOException {
        int writeBefore = buffer == null ? 0 : buffer.writerIndex();
        dubboCountCodec.encode(channel, buffer, message);
        checkOverPayload(message);
        int writeAfter = buffer == null ? 0 : buffer.writerIndex();
        int length = writeAfter - writeBefore;
        warningLengthTooLong(length, message);
    }

    private void checkOverPayload(Object message) {
        if (!(message instanceof Response)) return;
        Response response = (Response) message;
        if (Response.BAD_RESPONSE == response.getStatus() &&
            StrUtil.contains(response.getErrorMessage(), OVER_PAYLOAD_ERROR_MESSAGE)) {
            EXCEED_PAYLOAD_LIMIT_CACHE.put(response.getId(), response.getErrorMessage());
            return;
        }
        if (Response.OK == response.getStatus() &&
            EXCEED_PAYLOAD_LIMIT_CACHE.getIfPresent(response.getId()) != null) {
            String responseMessage = getResponseMessage(response);
            log.warn("dubbo序列化对象大小超过payload,errorMsg is {},response is {}",
                EXCEED_PAYLOAD_LIMIT_CACHE.getIfPresent(response.getId()), responseMessage);
        }
    }
}

With this monitoring in place, the team quickly identified additional large‑object APIs and began targeted optimizations.

Conclusion

Effective JVM memory tuning requires a combination of log analysis, monitoring, heap/stack inspection, and careful code changes. Introducing lightweight, non‑intrusive object monitoring helps locate problematic payloads without impacting the production system, and continuous refactoring, caching, and scheduled tasks further improve stability and performance.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

JVMMonitoringMemory OptimizationDubbogc
High Availability Architecture
Written by

High Availability Architecture

Official account for High Availability Architecture.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.