Why Elasticsearch Stalled: Uncovering Hidden STW Safepoint Issues in an ARM JDK

A detailed investigation of a slow Elasticsearch cluster revealed that massive Stop‑The‑World (STW) safepoint pauses caused by a buggy ARM‑based JDK version were the root cause, and switching to a proper Kona JDK eliminated the frequent ForceSafepoint interruptions and restored performance.

Tencent Cloud Middleware
Tencent Cloud Middleware
Tencent Cloud Middleware
Why Elasticsearch Stalled: Uncovering Hidden STW Safepoint Issues in an ARM JDK

The author encountered an Elasticsearch cluster that became extremely slow after a few hours of operation, with queries timing out until the service was restarted. Initial suspicion fell on a large number of Lucene Merge Threads observed in Thread Dump logs from three nodes (named 39, 158, 211).

Thread Dump Analysis

Thread counts per node showed:

Node 39: 366 total threads, 264 RUNNABLE, 64 WAITING, 28 TIMED_WAITING.

Node 158: 341 total threads, 221 RUNNABLE, 88 WAITING, 32 TIMED_WAITING.

Node 211: 282 total threads, 162 RUNNABLE, 92 WAITING, 28 TIMED_WAITING.

Further breakdown by thread pool highlighted that the Lucene Merge Thread pool on node 39 had 77 threads, while the other nodes had none. Additional screenshots showed lock contention on ExpiringCache#put and heavy HashMap#hash calculations.

Environment Investigation

The cluster consisted of three nodes with 500+ indices, each holding about 70 active shards. Write traffic was low (KB‑to‑few‑MB per minute), leading to many small segment files and frequent Flush operations.

GC Log and STW Investigation

GC logs showed very low GC frequency, so the focus shifted to STW (Stop‑The‑World) pauses recorded via -XX:+PrintGCApplicationStoppedTime. Initial STW logs displayed short pause times (≈0.0003 s), but the frequency of pauses was unusually high.

Statistical analysis of STW logs revealed a normal baseline of ~5 seconds of pause time per minute, but during the incident window the total pause time spiked to 20‑30 seconds per minute, explaining why CPU appeared idle while threads were “busy”.

Safepoint Type Identification

Enabling -XX:+PrintSafepointStatistics and related options showed that most pauses were of type ForceSafepoint , not the expected BiasLock or GC‑related safepoints.

ForceSafepoint is a catch‑all category; its exact trigger was unclear. Comparing logs from different environments indicated that the problematic cluster ran on an ARM‑based JDK of unknown provenance, while a similar X86 cluster did not exhibit the issue.

Root Cause and Resolution

After consulting the Kona JDK team, a debug‑instrumented ARM JDK version was tested; it reduced the ForceSafepoint frequency dramatically but still showed higher pause counts than a proper release build. Replacing the problematic JDK with the official Kona JDK eliminated ForceSafepoint occurrences (down to single‑digit counts per minute) and restored normal query latency.

Key metrics after the fix:

Original JDK: 5 000‑18 000 STW pauses per minute, total pause time 10‑30 seconds.

Kona JDK: <10 pauses per minute, total pause time 100‑200 ms.

Takeaways

The investigation demonstrated the importance of thorough environment profiling (JDK version, architecture) when diagnosing performance anomalies. It also highlighted that large numbers of Merge threads can be a symptom rather than the cause, and that ForceSafepoint pauses can severely degrade throughput even when CPU usage appears low.

-XX:+PrintGCDetails
-XX:+PrintGCDateStamps
-XX:+PrintTenuringDistribution
-XX:+PrintGCApplicationStoppedTime
-Xloggc:logs/gc.log
-XX:+UseGCLogFileRotation
-XX:NumberOfGCLogFiles=32
-XX:GCLogFileSize=32m
-XX:+PrintSafepointStatistics
-XX:PrintSafepointStatisticsCount=10
-XX:+UnlockDiagnosticVMOptions
-XX:-DisplayVMOutput
-XX:+LogVMOutput
-XX:LogFile=<vm_log_path>
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

JVMPerformanceThread DumpElasticsearchSafepointgcKona JDK
Tencent Cloud Middleware
Written by

Tencent Cloud Middleware

Official account of Tencent Cloud Middleware. Focuses on microservices, messaging middleware and other cloud‑native technology trends, publishing product updates, case studies, and technical insights. Regularly hosts tech salons to share effective solutions.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.