Operations 7 min read

Analysis of Serverless Scaling Failure Due to Full GC and Sentinel Protection Rules

The article analyzes a serverless scaling failure where newly added instances suffered high CPU and frequent Full GC leading to JVM crashes, reproduces the issue under load, and demonstrates how Sentinel's CPU‑based circuit‑breaker rule mitigates the problem across cold and hot start scenarios.

JD Retail Technology
JD Retail Technology
JD Retail Technology
Analysis of Serverless Scaling Failure Due to Full GC and Sentinel Protection Rules

During a full‑chain load test of ForceBot, a colleague observed that a newly scaled serverless instance (triggered after 50% load) experienced extreme CPU usage, frequent Full GC, and memory that was not reclaimed after each GC, as shown in the monitoring chart.

Analysis Conclusion: All memory was occupied by processing threads; Full GC reclaimed little memory, causing repeated GC cycles that consume CPU and pause the application, ultimately driving the JVM into a crash state.

Problem Emphasis : The theoretical analysis was validated by replaying the phenomenon. Under a load of 400 QPS per instance (CPU 30‑40%), without pre‑warming the Java service, the test reproduced high CPU, frequent Full GC, and JVM crash with a certain probability.

Analysis Conclusion: Instantaneous traffic spikes must be avoided to prevent the service from entering an overload state and being torn down.

Solution : A Sentinel system rule was introduced to automatically trigger circuit breaking when CPU usage exceeds 80%. The impact of this rule was compared across several scenarios.

1. Cold start without protection rule: Even without a crash, the system required 5‑7 minutes to recover from a near‑crash state, during which CPU stayed high and QPS fluctuated between 50‑100. After recovery, CPU settled at 30‑40% and QPS rose to 400, with no Sentinel breakage.

2. Hot start without protection rule: After a warm‑up run, the system no longer entered the pre‑crash “quasi‑crash” phase.

3. Cold start with protection rule: After restarting the Java process to simulate a cold start and applying the CPU‑80% rule, the system avoided the quasi‑crash state; the impact was limited to the first minute, after which normal performance resumed.

CPU usage during this scenario:

Sentinel circuit‑breaker activity showed about one minute of breakage.

4. Why cold‑start performance is slower:

• HotSpot JVM optimization: Hot code paths are JIT‑compiled and optimized, while cold‑start code runs unoptimized.

• Resource readiness: Thread pools and external connections may only be created after the application starts.

• Crash loop: A slowdown triggers more active threads, generating more objects, leading to more GC cycles, higher CPU consumption, further slowdown, and a feedback loop that can push the system into a quasi‑crash state until JIT optimizations or resource initialization improve performance.

Off‑topic Note : This issue is not limited to serverless cold‑scale; any sudden traffic surge followed by rapid scaling can cause similar crashes.

JVMperformanceserverlessoperationsSentinelFullGC
JD Retail Technology
Written by

JD Retail Technology

Official platform of JD Retail Technology, delivering insightful R&D news and a deep look into the lives and work of technologists.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.