Operations 14 min read

Diagnosing Elasticsearch node_concurrent_recoveries Slowness: Root Cause & Fix

A detailed investigation of an Elasticsearch timeout incident reveals how an overly aggressive node_concurrent_recoveries setting caused CPU saturation, disk I/O spikes, and shard relocation overload, and outlines the steps taken to isolate the faulty node and restore cluster performance.

Sohu Tech Products

Dec 6, 2023

Diagnosing Elasticsearch node_concurrent_recoveries Slowness: Root Cause & Fix

This article documents a performance incident in an Elasticsearch 6.x cluster where read requests began timing out around 19:30, despite stable traffic and no recent deployments.

Initial diagnostics showed timeout errors in the business logs and no traffic spikes. The cluster comprised more than 30 data nodes.

Monitoring dashboards highlighted one node (referred to as instance A) with abnormal metrics: es.node.threadpool.search.queue consistently at 1000, es.node.threadpool.search.rejected exceeding 100, CPU usage over 50% increase, and a higher number of completed searches, indicating hotspot indices.

Further host‑level analysis identified three machines (X, Y, Z). Machines X and Y showed sustained high disk I/O, while machine Z experienced a sharp CPU load increase (CPU idle dropped to single digits, load average rose fourfold) and disk I/O rose from 20% to ~50%.

Process snapshots on machine Z revealed two processes consuming over 2000% CPU each; one of them was the Elasticsearch process of instance A.

Hot thread and task inspections ( curl -XGET /_nodes/xx.xx.xx.xx/hot_threads?pretty -s and curl -XGET '/_cat/tasks?v&s=store' -s | grep A) indicated long‑running relocate tasks. The cluster settings showed cluster.routing.allocation.node_concurrent_recoveries set to 5, far above the default of 2, allowing many shard recoveries to run concurrently. curl -XGET '/_cluster/settings?pretty' -s This configuration caused a burst of shard relocations, consuming excessive CPU and I/O on instance A, which in turn led to the observed request timeouts.

To verify the hypothesis, the problematic node was excluded from allocation using:

curl -XPUT /_cluster/settings?pretty -H 'Content-Type:application/json' -d '{
  "transient": {
    "cluster.routing.allocation.exclude._ip": "xx.xx.xx.xx"
  }
}'

After exclusion, request latency dropped dramatically and timeout errors decreased.

Permanent remediation recommendations include resetting cluster.routing.allocation.node_concurrent_recoveries to its default (2), optionally enabling cluster.routing.use_adaptive_replica_selection for smarter shard selection, and scaling or redistributing instances if resource contention persists.

Additional best‑practice notes cover disk watermark settings ( cluster.routing.allocation.disk.watermark.low, high, flood_stage) and the impact of exceeding 90% or 95% disk usage on shard relocation and read‑only mode.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Operations Elasticsearch cluster Troubleshooting node_concurrent_recoveries

Written by

Sohu Tech Products

A knowledge-sharing platform for Sohu's technology products. As a leading Chinese internet brand with media, video, search, and gaming services and over 700 million users, Sohu continuously drives tech innovation and practice. We’ll share practical insights and tech news here.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.