ZGC: Principles, Tuning Practices, and Production Upgrade Experience
The article explains how Meituan’s risk‑control platform eliminated frequent 40 ms CMS pauses by adopting JDK 11’s ZGC—detailing its concurrent mark‑copy design, practical tuning parameters, real‑world case fixes, and measured latency reductions of up to 74 % while noting trade‑offs.
Many low‑latency, high‑availability Java services suffer from GC pauses, which affect system availability. ZGC, introduced in JDK 11, is a next‑generation low‑pause garbage collector designed for large‑heap, low‑latency scenarios.
The article discusses the pain points of GC, the principles of ZGC, practical tuning, and the results of upgrading to ZGC in Meituan’s risk‑control platform.
GC Pain
GC pause (Stop‑The‑World) stops all application threads. In Meituan’s risk‑control service, CMS caused Young GC pauses of ~40 ms, occurring 10 times per minute, increasing response latency and reducing availability.
ZGC Principles
ZGC uses a mostly concurrent mark‑copy algorithm. It reduces pause time to <10 ms regardless of heap size by making the initial mark, final mark, and initial relocate phases the only STW phases, whose duration depends only on the number of GC roots.
Key techniques:
Colored pointers store object liveness in high bits of the pointer.
Load barriers update references on the fly during concurrent relocation.
Address space layout: 0‑4 TB for Java heap, 4‑8 TB (M0), 8‑12 TB (M1), 16‑20 TB (Remapped). Objects have virtual addresses in all three spaces; only one is active at a time.
Tuning Practices
Typical ZGC tuning parameters (example):
-Xms10G -Xmx10G
-XX:ReservedCodeCacheSize=256m -XX:InitialCodeCacheSize=256m
-XX:+UnlockExperimentalVMOptions -XX:+UseZGC
-XX:ConcGCThreads=2 -XX:ParallelGCThreads=6
-XX:ZCollectionInterval=120 -XX:ZAllocationSpikeTolerance=5
-XX:+UnlockDiagnosticVMOptions -XX:-ZProactive
-Xlog:safepoint,classhisto*=trace,age*,gc*=info:file=/opt/logs/gc-%t.log:time,tid,tags:filecount=5,filesize=50mKey tuning points:
Enable fixed‑interval GC (‑XX:ZCollectionInterval) for traffic spikes.
Increase allocation‑spike tolerance (‑XX:ZAllocationSpikeTolerance) to trigger GC earlier.
Adjust concurrent GC threads (‑XX:ConcGCThreads) to speed up marking.
Case Studies
Four typical issues and solutions:
Memory‑allocation stalls during flash‑sale traffic – use fixed‑interval GC and larger tolerance.
Frequent GC with long pauses – increase concurrent GC threads.
Large number of ClassLoader roots causing 30 ms pauses – upgrade Aviator component to reduce ClassLoader creation.
Growing CodeCache causing pauses – reduce unnecessary JIT compilation by removing unused expressions.
Upgrade Effects
Latency improvements: TP999 reduced by 12‑142 ms (18‑74 %); TP99 reduced by 5‑28 ms (10‑47 %). Throughput may decline for CPU‑bound workloads because ZGC is a single‑generation collector and incurs load‑barrier overhead.
Evaluation
Assess benefit, cost, and risk before upgrading JDK 11 with ZGC. Benefits include lower pause latency; costs involve compatibility work and configuration changes; risks are mitigated by thorough testing.
Conclusion
ZGC provides sub‑10 ms pauses even for multi‑terabyte heaps, making it suitable for low‑latency services. Meituan’s experience shows that with proper tuning, ZGC can significantly improve service availability.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Meituan Technology Team
Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
