Backend Development 24 min read

Root Cause Analysis and ZGC Optimization for a High‑Concurrency Ticket Pricing Service

This article details the investigation of a 2% timeout rate in a billion‑request‑per‑day ticket pricing service, identifies GC‑induced stop‑the‑world pauses as the main cause, and demonstrates how switching from ParNew+CMS to G1 and finally to ZGC dramatically reduces latency and timeout rates.

Qunar Tech Salon
Qunar Tech Salon
Qunar Tech Salon
Root Cause Analysis and ZGC Optimization for a High‑Concurrency Ticket Pricing Service

The author, a senior backend engineer at Qunar, describes a critical ticket pricing system handling over 7 billion daily calls with sub‑2 ms average latency, where a 2% timeout rate was observed after a core system refactor.

Initial analysis of business metrics showed stable P99 latency around 8 ms, but a deeper look revealed that the timeout threshold (100 ms) was too low, causing a spike in timeout rates when upstream services set tighter limits.

Full‑link tracing using Dubbo access logs and QTRACER exposed that the provider side often did not receive requests before the caller timed out, indicating a missing 100 ms window.

Potential causes such as thread‑pool exhaustion, Netty I/O blockage, GC stop‑the‑world (STW) pauses, and network delays were listed. Scaling the service horizontally reduced timeouts, confirming the issue lay within the provider.

Thread‑pool size was doubled (Dubbo threads from 400 to 800, Netty iothreads from 16 to 32) but timeout rates remained unchanged, pointing to GC as the culprit.

GC logs showed frequent Young GC (ParNew) pauses of 170‑200 ms, which exceeded the 100 ms timeout window. The author therefore evaluated three GC strategies: tuning ParNew+CMS, switching to G1, and adopting ZGC.

After testing, ZGC demonstrated a three‑fold reduction in STW pause time compared to G1, while maintaining similar GC frequency. The service’s timeout rate dropped from 2% to 0.03% (approximately a 100× improvement), achieving near‑four‑9 availability.

Key configuration snippets:

public class QTraceFilter {
    @Activate(group = {Constants.CONSUMER}, before = "qaccesslogconsumer")
    public static class Consumer implements Filter {
        private static final QTraceClient traceClient = QTraceClientGetter.getClient();
        @Override
        public Result invoke(Invoker
invoker, Invocation inv) throws RpcException {
            final long startTime = System.currentTimeMillis();
            Result result = invoker.invoke(inv);
            // collect consumer metrics
        }
    }
    @Activate(group = {Constants.PROVIDER}, before = "qaccesslogprovider")
    public static class Provider implements Filter {
        private static final QTraceClient traceClient = QTraceClientGetter.getClient();
        @Override
        public Result invoke(Invoker
invoker, Invocation inv) throws RpcException {
            final long startTime = System.currentTimeMillis();
            Result result = invoker.invoke(inv);
            // collect provider metrics
        }
    }
}

Dubbo protocol adjustment:

<dubbo:protocol name="dubbo" port="20880" id="main" threads="800" iothreads="32"/>

JVM parameters for ParNew+CMS:

-Xms7g -Xmx7g -XX:NewSize=5g -XX:PermSize=256m -server -XX:SurvivorRatio=8 -XX:GCTimeRatio=2 -XX:+UseParNewGC -XX:ParallelGCThreads=2 -XX:+CMSParallelRemarkEnabled -XX:+UseCMSCompactAtFullCollection -XX:+UseFastAccessorMethods -XX:+CMSPermGenSweepingEnabled -XX:+CMSClassUnloadingEnabled -XX:CMSInitiatingOccupancyFraction=70 -XX:SoftRefLRUPolicyMSPerMB=0 -XX:+DisableExplicitGC -verbose:gc -XX:+PrintGCDateStamps -XX:+PrintGCDetails -Xloggc:$CATALINA_BASE/logs/gc.log

JVM parameters for ZGC:

-Xmx7g -Xms7g -XX:ReservedCodeCacheSize=256m -XX:InitialCodeCacheSize=256m -XX:+UnlockExperimentalVMOptions -XX:+UseZGC -XX:ConcGCThreads=4 -XX:ZAllocationSpikeTolerance=5 -Xlog:gc*:file=$CATALINA_BASE/logs/gc.log:time

The article concludes that ZGC, despite a modest throughput penalty, is highly effective for low‑latency services where STW pauses dominate tail latency, and provides practical guidance for migration and tuning.

JavaDubboGarbage CollectionSTWZGCperformance tuningLow Latency
Qunar Tech Salon
Written by

Qunar Tech Salon

Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.