Operations 11 min read

Why Does Elasticsearch OOM at Night? A Deep Dive into GC Tuning and Parameter Optimization

This case study examines a recurring out‑of‑memory issue in an Elasticsearch 6.8 cluster that spikes during low‑traffic night hours, analyzes GC logs and heap dumps, and details how adjusting G1 GC parameters resolved the problem and stabilized performance.

dbaplus Community
dbaplus Community
dbaplus Community
Why Does Elasticsearch OOM at Night? A Deep Dive into GC Tuning and Parameter Optimization

Background

The company's Elasticsearch (ES) cluster, version 6.8.0, supports critical services such as membership, marketing, and orders. On May 19 at 01:00, a node experienced an OOM and auto‑restarted, prompting an urgent investigation because the cluster underpins all business lines.

Problem Observation

Monitoring revealed that between 23:00 and 07:00 the cluster’s response time fluctuated dramatically, with frequent slow‑GC and OOM logs, while daytime performance remained stable.

Investigation Steps

Initial hypothesis linked the issue to a recent business change; however, discussions showed a data‑import service had been taken offline, not a new feature.

Analyzed heap dumps: unlike typical OOM cases caused by massive aggregations, the dumps showed many normal objects and large long[] arrays allocated in the old generation.

Identified that certain queries triggered heavy SegmentReader reads, leading to large allocations in the old heap.

Compared two dumps taken before and after the spike; no new large objects appeared, confirming the issue lay in GC behavior rather than object size.

GC Root Cause

Slow‑GC logs consistently contained "to‑space exhausted" messages. According to Oracle G1GC documentation, this occurs when the old generation lacks space, forcing all young‑generation objects to be promoted, which stalls the JVM.

Further analysis with jstat -gc <pid> 1000 1000 showed simultaneous growth of Eden and Old spaces without frequent GC, causing large long[] allocations (> regionSize/2) to be placed directly into the old generation. When the old space neared capacity, a to‑space exhausted event triggered a costly full GC, manifesting as the observed slow‑GC spikes.

Night‑time spikes were explained by lower overall traffic: daytime traffic generates many small GCs that continuously reclaim old‑generation space, whereas at night the lack of such activity lets old space fill up.

Verification

A test program injected a large volume of data into the cluster during a low‑traffic window on May 20. The injection prevented slow‑GC logs and memory spikes; once the injection stopped, the spikes returned, confirming that keeping the old generation active mitigates the issue.

Parameter Optimization

Goal: accelerate old‑generation reclamation and avoid to‑space exhausted.

Adjusted G1GC settings: -XX:MaxGCPauseMillis=100 – encourages more frequent, shorter GCs. -XX:G1MaxNewSizePercent=40 – reduces the maximum young‑generation size, ensuring the old generation retains sufficient free space.

Other parameters such as InitiatingHeapOccupancyPercent were considered but rejected because too low a value would cause excessive concurrent and mixed GCs, increasing pause times.

Results

Post‑tuning monitoring showed:

Significant reduction in average request latency both day and night.

Lower latency spikes (90th percentile) and near‑zero error rates.

More stable heap usage with smoother curves.

The three‑night validation confirmed the expected performance improvements.

Conclusion

Even well‑behaved clusters can suffer night‑time OOM due to GC configuration. By understanding the "to‑space exhausted" symptom and tuning G1GC parameters to keep the old generation from filling, stability and latency were restored.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Performance Optimizationg1gcGC tuningOutOfMemory
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.