Why Does Elasticsearch OOM at Night? A Deep Dive into GC Tuning and Parameter Optimization
This case study examines a recurring out‑of‑memory issue in an Elasticsearch 6.8 cluster that spikes during low‑traffic night hours, analyzes GC logs and heap dumps, and details how adjusting G1 GC parameters resolved the problem and stabilized performance.
Background
The company's Elasticsearch (ES) cluster, version 6.8.0, supports critical services such as membership, marketing, and orders. On May 19 at 01:00, a node experienced an OOM and auto‑restarted, prompting an urgent investigation because the cluster underpins all business lines.
Problem Observation
Monitoring revealed that between 23:00 and 07:00 the cluster’s response time fluctuated dramatically, with frequent slow‑GC and OOM logs, while daytime performance remained stable.
Investigation Steps
Initial hypothesis linked the issue to a recent business change; however, discussions showed a data‑import service had been taken offline, not a new feature.
Analyzed heap dumps: unlike typical OOM cases caused by massive aggregations, the dumps showed many normal objects and large long[] arrays allocated in the old generation.
Identified that certain queries triggered heavy SegmentReader reads, leading to large allocations in the old heap.
Compared two dumps taken before and after the spike; no new large objects appeared, confirming the issue lay in GC behavior rather than object size.
GC Root Cause
Slow‑GC logs consistently contained "to‑space exhausted" messages. According to Oracle G1GC documentation, this occurs when the old generation lacks space, forcing all young‑generation objects to be promoted, which stalls the JVM.
Further analysis with jstat -gc <pid> 1000 1000 showed simultaneous growth of Eden and Old spaces without frequent GC, causing large long[] allocations (> regionSize/2) to be placed directly into the old generation. When the old space neared capacity, a to‑space exhausted event triggered a costly full GC, manifesting as the observed slow‑GC spikes.
Night‑time spikes were explained by lower overall traffic: daytime traffic generates many small GCs that continuously reclaim old‑generation space, whereas at night the lack of such activity lets old space fill up.
Verification
A test program injected a large volume of data into the cluster during a low‑traffic window on May 20. The injection prevented slow‑GC logs and memory spikes; once the injection stopped, the spikes returned, confirming that keeping the old generation active mitigates the issue.
Parameter Optimization
Goal: accelerate old‑generation reclamation and avoid to‑space exhausted.
Adjusted G1GC settings: -XX:MaxGCPauseMillis=100 – encourages more frequent, shorter GCs. -XX:G1MaxNewSizePercent=40 – reduces the maximum young‑generation size, ensuring the old generation retains sufficient free space.
Other parameters such as InitiatingHeapOccupancyPercent were considered but rejected because too low a value would cause excessive concurrent and mixed GCs, increasing pause times.
Results
Post‑tuning monitoring showed:
Significant reduction in average request latency both day and night.
Lower latency spikes (90th percentile) and near‑zero error rates.
More stable heap usage with smoother curves.
The three‑night validation confirmed the expected performance improvements.
Conclusion
Even well‑behaved clusters can suffer night‑time OOM due to GC configuration. By understanding the "to‑space exhausted" symptom and tuning G1GC parameters to keep the old generation from filling, stability and latency were restored.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
