Operations 10 min read

Elasticsearch Cluster Recovery Pitfall: Excessive Shard Recovery Concurrency Leads to Cluster Hang

This article details a real‑world Elasticsearch cluster recovery issue where setting the shard recovery concurrency too high saturated the generic thread pool, causing the entire cluster to hang, and explains the underlying concepts, reproduction steps, analysis, and mitigation measures.

Tencent Database Technology
Tencent Database Technology
Tencent Database Technology
Elasticsearch Cluster Recovery Pitfall: Excessive Shard Recovery Concurrency Leads to Cluster Hang

In this post we share an Elasticsearch (ES) troubleshooting experience that occurred during cluster recovery when the shard recovery concurrency was set excessively high, causing the cluster to become unresponsive.

1. Scenario Description

A Tencent Cloud ES cluster with 15 nodes, over 2700 indices, more than 15000 shards and several terabytes of data experienced a node restart. The cluster entered a yellow state with many unassigned shards. To speed up recovery, the default node_concurrent_recoveries value of 2 was increased to 100 using the following request:

curl -X PUT "localhost:9200/_cluster/settings" -H 'Content-Type: application/json' -d'{
    "persistent": {
        "cluster.routing.allocation.node_concurrent_recoveries": 100,
        "indices.recovery.max_bytes_per_sec": "40mb"
    }
}'

2. Basic Concepts

After the change, the number of unassigned shards initially dropped quickly, then stalled at a fixed value. All nodes' generic thread pools became saturated, preventing any new index creation or operations that rely on the generic pool, while existing reads/writes continued to work.

3. ES Thread Pools

Elasticsearch nodes have several thread pools, the most relevant being:

generic : handles background tasks such as node discovery and shard recovery; default size is processors * 4 (min 128, max 512).

index : used for index/delete operations; default size equals the number of processors.

search : processes search requests; default size is int((processors * 3) / 2) + 1 .

get : handles get requests; default size equals the number of processors.

write : processes document indexing and bulk requests; default size equals the number of processors.

For a full list see the official ES documentation.

4. Shard Recovery Process

ES defines three cluster health states: green (all primary and replica shards assigned), yellow (all primaries assigned, some replicas unassigned), and red (some primaries unassigned). When a node fails, the cluster enters the shard recovery phase (PEER recovery) where the master updates cluster metadata and target nodes pull shard data from source nodes.

The PEER recovery concurrency is controlled by the following settings:

cluster.routing.allocation.node_concurrent_incoming_recoveries : max incoming recoveries per node.

cluster.routing.allocation.node_concurrent_outgoing_recoveries : max outgoing recoveries per node.

cluster.routing.allocation.node_concurrent_recoveries : sets both incoming and outgoing limits to the same value.

Setting these values too high can exhaust the generic thread pool.

5. Problem Reproduction and Analysis

A three‑node ES 6.4.3 cluster was built (1 CPU, 2 GB RAM per node) with 300 indices, each having 5 primary shards and 1 replica (total 3000 shards). The following setting increased the recovery concurrency to 200:

curl -X PUT "localhost:9200/_cluster/settings" -H 'Content-Type: application/json' -d'{
    "persistent": {
        "cluster.routing.allocation.node_concurrent_recoveries": 200 // set recovery concurrency
    }
}'

After stopping one node to simulate a failure, shard recovery stalled, and the generic thread pool reached its maximum of 128 threads. Existing read/write operations continued, but any action requiring the generic pool (e.g., creating a new index) hung.

Stack traces showed all generic threads blocked in the PEER recovery phase. The recovery workflow involves multiple remote calls (steps 2, 4, 5, 7, 10, 12) that use synchronous AQS‑based sync objects. When both source and target nodes have their generic pools saturated, requests wait on each other, creating a distributed deadlock.

6. Mitigation

To avoid side effects, the Tencent Cloud ES service now rejects configuration changes that set the recovery concurrency beyond a safe threshold and informs the user. An issue has also been filed with the Elasticsearch project (https://github.com/elastic/elasticsearch/issues/36195) for further investigation.

7. Conclusion

The article demonstrates how an overly aggressive shard recovery concurrency setting can saturate the generic thread pool and cause a cluster-wide hang, providing detailed analysis and practical mitigation steps for ES operators.

Elasticsearchtroubleshootingthread poolCluster Recoveryshard-recovery
Tencent Database Technology
Written by

Tencent Database Technology

Tencent's Database R&D team supports internal services such as WeChat Pay, WeChat Red Packets, Tencent Advertising, and Tencent Music, and provides external support on Tencent Cloud for TencentDB products like CynosDB, CDB, and TDSQL. This public account aims to promote and share professional database knowledge, growing together with database enthusiasts.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.