Why Your Java Service Returns 503 Errors – Diagnosing Full GC and Tuning JVM Parameters

The article explains how intermittent 503 errors in a Java service are caused by long‑lasting Full GC pauses, walks through log analysis, shows how to use jstat, jcmd and MAT to pinpoint the problem, and provides a complete set of JVM tuning flags to eliminate the issue.

Architect-Kip
Architect-Kip
Architect-Kip
Why Your Java Service Returns 503 Errors – Diagnosing Full GC and Tuning JVM Parameters

Problem

Production Java service occasionally returns short‑lived 503 errors. Each 503 coincides with a long Full GC pause, indicating a stop‑the‑world (STW) event caused by the default Parallel Scavenge (PS) collector in JDK 8.

Investigation

Check JVM start‑up parameters – the service runs with the default PS collector, which performs Full GC in a single thread and blocks the application.

Examine GC logs. Example cases:

Minor GC (case 1) : survivor space before GC ~217 MB, after GC cleared – young generation works correctly.

Full GC (case 2) : Full GC takes long time while old generation uses only ~700 MB. jstat -gccause <pid> reports cause Ergonomics , i.e., the JVM adaptive policy triggered the collection. Survivor space is limited to 4 MB, which is far too small and leads to unreasonable adaptive sizing.

GC cause classification

Younger generation

Allocation Failure – Eden exhausted. Fix: increase Eden size.

System.gc() – explicit call. Fix: remove or disable with -XX:+DisableExplicitGC.

Ergonomics – JVM decides GC is needed. Fix: tune or disable adaptive size policy.

Metadata GC Threshold – Metaspace limit reached. Fix: enlarge Metaspace.

Full GC

Metadata GC Threshold – increase Metaspace.

System.gc() – disable with -XX:+DisableExplicitGC.

Ergonomics – adjust heap size or GC strategy.

Allocation Failure – investigate leaks or enlarge old generation.

Concurrent Mode Failure – tune CMS thresholds.

Promotion Failed – reduce promotion rate or enlarge old generation.

Diagnostic commands

Monitor GC and memory every 2 seconds (focus on Old Gen): jstat -gcutil &lt;pid&gt; 2000

Query detailed GC info and cause history: jcmd &lt;pid&gt; GC.heap_info jcmd &lt;pid&gt; GC.last_gc_cause

Generate a live heap histogram (object count): jmap -histo:live &lt;pid&gt; | head -n 20

Generate a live heap histogram sorted by memory usage: jmap -histo:live &lt;pid&gt; | sort -n -k3 -r

Heap dump analysis with Eclipse MAT

If the above steps do not reveal the root cause, dump the heap with jmap -heap <pid> and open it in Eclipse Memory Analyzer (MAT). Typical MAT views used:

Histogram – object count and shallow size per class.

Dominator Tree – objects that dominate memory usage.

Top Consumers – groups by class/package to find biggest consumers.

Leak Suspects – automatic leak detection.

List Objects – outgoing and incoming references for a selected object.

Root cause

The JVM parameters were mis‑configured for a container with 6 GB memory. The default PS collector caused long STW pauses. Switching to the CMS collector with ParNew for the young generation and tuning heap and Metaspace sizes eliminated the pauses.

Recommended JVM configuration

Container awareness

-XX:+UseContainerSupport
-XX:+UseCGroupMemoryLimitForHeap

Heap size (adjust to container limits)

-XX:MaxRAMPercentage=75.0
-XX:InitialRAMPercentage=75.0
-XX:MinRAMPercentage=75.0

or explicit -Xms4g -Xmx4g if the container memory is fixed.

Metaspace and code cache

-XX:MaxMetaspaceSize=512m
-XX:MetaspaceSize=256m
-XX:CompressedClassSpaceSize=128m
-XX:ReservedCodeCacheSize=256m
-XX:InitialCodeCacheSize=64m

GC algorithm

-XX:+UseConcMarkSweepGC
-XX:+UseParNewGC
-XX:+CMSParallelRemarkEnabled
-XX:+CMSParallelInitialMarkEnabled
-XX:+CMSEdenChunksRecordAlways

CMS trigger and concurrency control

-XX:CMSInitiatingOccupancyFraction=70
-XX:+UseCMSInitiatingOccupancyOnly
-XX:ConcGCThreads=2
-XX:ParallelGCThreads=4
-XX:+UseCMSCompactAtFullCollection
-XX:CMSFullGCsBeforeCompaction=0
-XX:+CMSClassUnloadingEnabled

Young/old generation ratio

-XX:NewRatio=2

(young:old ≈ 1:2)

-XX:SurvivorRatio=8
-XX:MaxTenuringThreshold=15
-XX:+UseAdaptiveSizePolicy

(disable for low‑latency workloads)

GC logging

-XX:+PrintGCDetails
-XX:+PrintGCDateStamps
-XX:+PrintGCTimeStamps
-XX:+PrintHeapAtGC
-XX:+PrintTenuringDistribution
-Xloggc:/app/logs/gc.log
-XX:+UseGCLogFileRotation
-XX:NumberOfGCLogFiles=5
-XX:GCLogFileSize=10M

Performance optimisations

-XX:+UseCompressedOops
-XX:+UseCompressedClassPointers
-XX:+TieredCompilation
-XX:CICompilerCount=4

Safety

-Djava.security.egd=file:/dev/./urandom
-Duser.timezone=Asia/Shanghai

Failure handling

-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/app/logs/heap-dump.hprof
-XX:ErrorFile=/app/logs/hs_err_pid%p.log
-XX:+DisableExplicitGC
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

JavaJVMDockerGarbage CollectionPerformance Tuningmemory leakCMSFull GC
Architect-Kip
Written by

Architect-Kip

Daily architecture work and learning summaries. Not seeking lengthy articles—only real practical experience.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.