How We Cut Full GC Frequency from 40×/Day to Once Every 10 Days
Over a month of JVM tuning, the author reduced Full GC from more than 40 times per day to once every ten days and halved Young GC duration by adjusting heap sizes, fixing memory leaks, and tuning metaspace, ultimately improving server throughput and stability.
After more than a month of effort, Full GC was reduced from about 40 times per day to roughly once every ten days, and Young GC time was cut by more than half, prompting a detailed record of the optimization process.
Initially, the production servers (2 CPU / 4 GB RAM, four‑node cluster) suffered frequent Full GC (≈40 times daily) and occasional automatic restarts, indicating severe JVM memory pressure.
The original JVM startup parameters were:
-Xms1000M
-Xmx1800M -Xmn350M -Xss300K
-XX:+DisableExplicitGC
-XX:SurvivorRatio=4 -XX:+UseParNewGC
-XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=70
-XX:+CMSParallelRemarkEnabled
-XX:LargePageSizeInBytes=128M
-XX:+UseFastAccessorMethods
-XX:+UseCMSInitiatingOccupancyOnly
-XX:+PrintGCDetails
-XX:+PrintGCTimeStamps
-XX:+PrintHeapAtGCKey meanings:
-Xmx1800M sets the maximum heap size.
-Xms1000M sets the initial heap size; matching it to -Xmx avoids repeated heap resizing after GC.
-Xmn350M defines a 350 MB young generation (≈3/8 of the heap, per Sun recommendation).
-Xss300K sets each thread stack size.
First Optimization
Observing that the young generation was too small, the configuration was changed to increase it and align initial and maximum heap sizes:
-Xmn350M → -Xmn800M -XX:SurvivorRatio=4 → -XX:SurvivorRatio=8 -Xms1000M → -Xms1800M
After deploying the new settings to two production nodes (prod, prod2) for five days, Young GC frequency dropped by more than half and its duration decreased by 400 s, but Full GC frequency rose by 41 occurrences, indicating a regression.
The first attempt was deemed a failure.
Second Optimization – Memory Leak Investigation
A manager discovered an object T with over ten thousand instances occupying ~20 MB. The cause was an anonymous inner‑class listener retaining the object:
public void doSmthing(T t){
redis.addListener(new Listener(){
public void onTimeout(){
if(t.success()){
// execute operation
}
}
});
}The listener was never released after timeout, preventing T from being garbage‑collected.
After fixing the listener leak, GC behavior improved slightly but the server still restarted unexpectedly.
Memory Leak Deep Dive
Further dumps revealed tens of thousands of ByteArrowRow objects, indicating massive database query/insert activity. Traffic monitoring showed a sudden inbound spike to 83 MB/s without corresponding business load, later traced to a missing module condition in a query that fetched over 400 k rows.
Correcting the query eliminated the leak; after redeploying with the original parameters, Full GC occurred only five times over three days.
Third Optimization – Metaspace and Heap Tuning
With the leak resolved, further tuning focused on metaspace, which had grown to ~200 MB (default 21 MB) and triggered Full GC. The following changes were applied to prod1 and prod2 (prod3/4 unchanged):
-Xmn350M → -Xmn800M -Xms1000M → 1800M -XX:MetaspaceSize=200M -XX:CMSInitiatingOccupancyFraction=75
and
-Xmn350M → -Xmn600M -Xms1000M → 1800M -XX:MetaspaceSize=200M -XX:CMSInitiatingOccupancyFraction=75
After ten days of observation, prod1 and prod2 showed dramatically lower Full GC counts and Young GC frequencies compared to prod3 and prod4, and prod1 achieved the highest throughput (more thread starts).
Prod4, which kept the original settings, exhibited far higher Full GC and Young GC rates.
Overall, the optimization succeeded: Full GC frequency and duration were cut by more than half, and prod1’s configuration delivered the best throughput.
Summary
Full GC occurring more than once per day is abnormal.
When Full GC spikes, prioritize investigating memory leaks.
After fixing leaks, JVM tuning opportunities become limited; avoid excessive time investment.
If CPU stays high after code checks, consult operations (e.g., cloud provider) – a server issue caused 100% CPU in this case.
High inbound traffic may stem from database queries; verify query conditions.
Regularly monitor GC to detect problems early.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Programmer DD
A tinkering programmer and author of "Spring Cloud Microservices in Action"
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
