Root Cause Analysis and GC Parameter Optimization for Elasticsearch OOM Issues in the Membership Service
This article details a comprehensive investigation of an out‑of‑memory crash in a critical Elasticsearch cluster, explains how GC logs and heap dumps revealed a to‑space‑exhausted condition, and describes the G1GC tuning parameters that eliminated the nightly spikes and stabilized performance.
Background: The membership service relies on an Elasticsearch (ES) cluster (version 6.8.0) that experienced an OOM event at 01:00 on May 19, causing a node to restart and affecting all business lines.
Problem Observation: Monitoring showed large latency spikes between 23:00 and 07:00, accompanied by slow GC and OOM logs. Dump files contained many normal objects but a large number of long[] arrays allocated directly in the old generation.
Investigation Steps:
First hypothesis linked the issue to a recent business change, but the change was a service de‑registration, not a new feature.
Second hypothesis examined the heap dumps; the large long[] arrays suggested heavy segment reads triggering old‑gen allocation.
Third hypothesis returned to GC behavior, discovering repeated "to‑space exhausted" messages in the G1GC logs.
Root Cause Analysis: The "to‑space exhausted" error occurs when the old generation is nearly full and a young‑gen collection tries to promote objects, causing a long pause and the observed slow‑GC logs. Night‑time traffic patterns prevented frequent small GCs, allowing the old generation to fill.
Solution – GC Parameter Tuning:
Reduce -XX:MaxGCPauseMillis from the default 200 ms to 100 ms to force more frequent mixed GCs.
Adjust -XX:G1MaxNewSizePercent from 60 % to 40 % so the young generation is smaller, limiting promotions to the old generation.
These changes avoid triggering "to‑space exhausted" and keep the old generation from filling up.
Results: After applying the new parameters and validating over three nights of load testing, the cluster showed smoother heap usage, lower latency, reduced spikes, and near‑zero error rates. Monitoring graphs confirmed stable heap consumption and improved response times.
Conclusion: Systematic log analysis, dump inspection, and targeted GC tuning resolved the nightly OOM issue, demonstrating the importance of thorough root‑cause investigation and careful JVM parameter configuration for critical backend services.
Tongcheng Travel Technology Center
Pursue excellence, start again with Tongcheng! More technical insights to help you along your journey and make development enjoyable.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.