Backend Development 13 min read

How Switching from CMS to G1 Boosted Java Service Stability and Cut Costs

The article details how a Java team diagnosed frequent Full GC pauses caused by CMS, migrated to G1GC, adjusted JVM flags, scaled container specs, and achieved dramatically lower pause times, reduced instance counts, and significant cost savings while improving overall service stability.

Youzan Coder

Nov 28, 2022

How Switching from CMS to G1 Boosted Java Service Stability and Cut Costs

Background and Motivation

Since early 2022 the team has been optimizing business‑critical Java services for stability and lower machine cost. After exhausting code‑level improvements, they turned to JVM tuning. All services run on JDK 1.8.0_201, which supports several garbage collectors; most still use the legacy ParNew + CMS configuration.

Stability Issue Analysis

Two recurring alerts were observed:

Dubbo thread‑pool saturation alerts.

Upstream call timeout logs.

Investigation showed traffic was stable, but provider P99 and average response times spiked, indicating slow provider responses. JFR traces revealed occasional blocking points, but the major culprit was identified as JVM Full GC pauses that coincided with the alerts. Monitoring data showed multiple Full GC events within a minute, each lasting several seconds, causing STW pauses that filled thread pools and increased latency.

Using CMS, large objects were allocated directly into the old generation due to fragmentation in Young GC, triggering frequent Full GCs. The root cause was therefore a GC‑induced pause rather than application‑level bottlenecks.

Adopting G1GC to Resolve the Issues

G1GC’s region‑based heap eliminates fragmentation and uses a mark‑compact algorithm, preventing the “no contiguous space” problem for large objects. Its special Humongous region efficiently handles big objects, and Global Concurrent Marking can reclaim them during Mixed GC, dramatically reducing STW duration.

After removing the old CMS flags and adding the following G1 options, the team observed immediate improvement:

-XX:+UseG1GC -XX:+UnlockExperimentalVMOptions -XX:G1NewSizePercent=20

Other G1 defaults (e.g., -XX:MaxGCPauseMillis=200) were left unchanged. Post‑migration monitoring showed that Full GC events disappeared entirely.

Limitations of G1GC

G1 performs best with sufficiently large heap sizes (e.g., 4 GB on 4c8g containers). On smaller instances (1c2g, 2c4g) its throughput and reclamation efficiency drop because:

RSet bookkeeping for cross‑region references consumes extra memory, reducing usable heap.

Small heaps produce many tiny regions; Humongous objects become frequent, leading to Mixed GC storms and potential Full GCs.

Additional CPU overhead from write barriers, SATB, and Refine threads further stresses limited resources.

Scaling G1 Benefits with Larger Instances

When heap size is ample, G1’s pause‑time predictability shines. The team upgraded core services from 4c8g (4 GB heap) to 8c16g, halving the number of container instances while maintaining total capacity. This reduced CPU pressure and allowed further instance consolidation.

Practical Migration Steps

Audit dependent middleware (Dubbo, HTTP, MQ, DB, KV, rate‑limiters) and increase thread pool sizes proportionally to the reduced instance count.

Review custom business thread pools and double their size if they were near saturation.

Upgrade container specs to 8c16g.

Perform QA and pre‑release validation.

Deploy to production, monitor RT, CPU, memory, and GC (Young GC < 200 ms target). If metrics are stable, gradually halve the instance count.

Results

After the pilot:

Application A: 180 × 4c8g → 80 × 8c16g

Application B: 250 × 4c8g → 110 × 8c16g

Application C: 170 × 4c8g → 80 × 8c16g

Cost dropped sharply while stability improved: Young GC frequency fell three‑fold, Full GC vanished, max response time halved, and latency spikes were eliminated. The team also noted lower CPU utilization due to reduced container count.

Future Outlook

Not all services can adopt G1; small‑heap workloads will likely stay on CMS. The team plans to evaluate newer OpenJDK releases for additional G1 optimizations and consider further upgrades where hardware permits.

backend Java JVM Performance Tuning g1gc CMS

Written by

Youzan Coder

Official Youzan tech channel, delivering technical insights and occasional daily updates from the Youzan tech team.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.