Why Does Full GC Stall During Sales Peaks? A Deep Dive into DB Connection Pool Issues
During a major sales promotion, an API suffered timeouts due to Full GC pauses over 500 ms, which were traced to stale MySQL connections in the DBCP pool accumulating in the old generation; the article details the investigation steps, root cause, and mitigation strategies such as switching to G1 GC and adjusting eviction settings.
Problem Description
During a large promotion, an interface experienced increased timeouts. Monitoring showed Full GC pauses exceeding 500 ms, coinciding with the timeouts.
Application Basics
Container: 8C12G
JVM options:
-XX:+UseConcMarkSweepGC -Xms6144m -Xmx6144m -Xmn2048m -XX:ParallelGCThreads=8 -XX:+UseCMSInitiatingOccupancyOnly -XX:CMSInitiatingOccupancyFraction=70 -XX:+ParallelRefProcEnabledDatabase: MySQL
Connection pool: DBCP
Investigation Process
1. Long GC pauses indicated many garbage objects.
2. Memory leak was ruled out because heap after Full GC looked normal.
3. Dumped heap before and after Full GC; observed that many database‑related objects were reclaimed after Full GC.
4. Analyzed the number of connections via OQL; many connections exceeded maxActive, indicating many stale connections.
5. Concluded that numerous stale connections entered the old generation, causing long Full GC.
6. Adjusted timeBetweenEvictionRunsMillis from 1 min to 10 s, but the issue persisted.
7. Examined DBCP source: the evictor runs according to timeBetweenEvictionRunsMillis and evicts connections idle longer than minEvictableIdleTimeMillis. If testWhileIdle is true, validationQuery runs, but the connection’s idle time is not refreshed, so idle connections are easily evicted during low traffic, and the associated objects are expensive to collect.
8. Noted that the problem only appears during high‑traffic promotions because GC becomes frequent, moving idle connections into the old generation where they are reclaimed slowly.
9. Determined that the root cause is the connection pool lacking true “keep‑alive” capability, leading to frequent connection churn and long Full GC pauses.
Solution
Switch to the G1 garbage collector.
Set minEvictableIdleTimeMillis to 0.
Summary of Findings
The DBCP pool does not keep connections alive; stale connections accumulate in the old generation, and their phantom references carry many objects. When they are reclaimed by Full GC, the pause time is high, causing interface timeouts.
Extended Knowledge
Druid pool also lacks keep‑alive; newer versions offer a “KeepAlive” option.
In Druid, the configured validationQuery is often not executed because the MySQL driver validates via pingInternal.
Both DBCP and Druid use FIFO ordering; under low load, only the first connections are reused, while others are repeatedly evicted and recreated.
Phantom references require two GC cycles to be reclaimed; if they reside in the old generation, two Full GCs are needed, increasing GC pressure.
Related concepts include the finalize method.
CMS default MaxTenuringThreshold is 6, while ParallelGC and G1 default to 15.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
JD Cloud Developers
JD Cloud Developers (Developer of JD Technology) is a JD Technology Group platform offering technical sharing and communication for AI, cloud computing, IoT and related developers. It publishes JD product technical information, industry content, and tech event news. Embrace technology and partner with developers to envision the future.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
