Why Your Java Service Returns 503 Errors – Diagnosing Full GC and Tuning JVM Parameters
The article explains how intermittent 503 errors in a Java service are caused by long‑lasting Full GC pauses, walks through log analysis, shows how to use jstat, jcmd and MAT to pinpoint the problem, and provides a complete set of JVM tuning flags to eliminate the issue.
Problem
Production Java service occasionally returns short‑lived 503 errors. Each 503 coincides with a long Full GC pause, indicating a stop‑the‑world (STW) event caused by the default Parallel Scavenge (PS) collector in JDK 8.
Investigation
Check JVM start‑up parameters – the service runs with the default PS collector, which performs Full GC in a single thread and blocks the application.
Examine GC logs. Example cases:
Minor GC (case 1) : survivor space before GC ~217 MB, after GC cleared – young generation works correctly.
Full GC (case 2) : Full GC takes long time while old generation uses only ~700 MB. jstat -gccause <pid> reports cause Ergonomics , i.e., the JVM adaptive policy triggered the collection. Survivor space is limited to 4 MB, which is far too small and leads to unreasonable adaptive sizing.
GC cause classification
Younger generation
Allocation Failure – Eden exhausted. Fix: increase Eden size.
System.gc() – explicit call. Fix: remove or disable with -XX:+DisableExplicitGC.
Ergonomics – JVM decides GC is needed. Fix: tune or disable adaptive size policy.
Metadata GC Threshold – Metaspace limit reached. Fix: enlarge Metaspace.
Full GC
Metadata GC Threshold – increase Metaspace.
System.gc() – disable with -XX:+DisableExplicitGC.
Ergonomics – adjust heap size or GC strategy.
Allocation Failure – investigate leaks or enlarge old generation.
Concurrent Mode Failure – tune CMS thresholds.
Promotion Failed – reduce promotion rate or enlarge old generation.
Diagnostic commands
Monitor GC and memory every 2 seconds (focus on Old Gen): jstat -gcutil <pid> 2000
Query detailed GC info and cause history: jcmd <pid> GC.heap_info jcmd <pid> GC.last_gc_cause
Generate a live heap histogram (object count): jmap -histo:live <pid> | head -n 20
Generate a live heap histogram sorted by memory usage: jmap -histo:live <pid> | sort -n -k3 -r
Heap dump analysis with Eclipse MAT
If the above steps do not reveal the root cause, dump the heap with jmap -heap <pid> and open it in Eclipse Memory Analyzer (MAT). Typical MAT views used:
Histogram – object count and shallow size per class.
Dominator Tree – objects that dominate memory usage.
Top Consumers – groups by class/package to find biggest consumers.
Leak Suspects – automatic leak detection.
List Objects – outgoing and incoming references for a selected object.
Root cause
The JVM parameters were mis‑configured for a container with 6 GB memory. The default PS collector caused long STW pauses. Switching to the CMS collector with ParNew for the young generation and tuning heap and Metaspace sizes eliminated the pauses.
Recommended JVM configuration
Container awareness
-XX:+UseContainerSupport -XX:+UseCGroupMemoryLimitForHeapHeap size (adjust to container limits)
-XX:MaxRAMPercentage=75.0 -XX:InitialRAMPercentage=75.0 -XX:MinRAMPercentage=75.0or explicit -Xms4g -Xmx4g if the container memory is fixed.
Metaspace and code cache
-XX:MaxMetaspaceSize=512m -XX:MetaspaceSize=256m -XX:CompressedClassSpaceSize=128m -XX:ReservedCodeCacheSize=256m -XX:InitialCodeCacheSize=64mGC algorithm
-XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:+CMSParallelRemarkEnabled -XX:+CMSParallelInitialMarkEnabled -XX:+CMSEdenChunksRecordAlwaysCMS trigger and concurrency control
-XX:CMSInitiatingOccupancyFraction=70 -XX:+UseCMSInitiatingOccupancyOnly -XX:ConcGCThreads=2 -XX:ParallelGCThreads=4 -XX:+UseCMSCompactAtFullCollection -XX:CMSFullGCsBeforeCompaction=0 -XX:+CMSClassUnloadingEnabledYoung/old generation ratio
-XX:NewRatio=2(young:old ≈ 1:2)
-XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=15 -XX:+UseAdaptiveSizePolicy(disable for low‑latency workloads)
GC logging
-XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps -XX:+PrintHeapAtGC -XX:+PrintTenuringDistribution -Xloggc:/app/logs/gc.log -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=5 -XX:GCLogFileSize=10MPerformance optimisations
-XX:+UseCompressedOops -XX:+UseCompressedClassPointers -XX:+TieredCompilation -XX:CICompilerCount=4Safety
-Djava.security.egd=file:/dev/./urandom -Duser.timezone=Asia/ShanghaiFailure handling
-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/app/logs/heap-dump.hprof -XX:ErrorFile=/app/logs/hs_err_pid%p.log -XX:+DisableExplicitGCSigned-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Architect-Kip
Daily architecture work and learning summaries. Not seeking lengthy articles—only real practical experience.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
