How to Diagnose and Fix Elasticsearch Throttling Allocation Issues
This guide explains how to use the Elasticsearch GET /_cluster/allocation/explain API to identify throttling deciders, interpret the underlying allocation limits, and adjust persistent or transient cluster routing settings—such as node_concurrent_recoveries and indices.recovery.max_bytes_per_sec—to resolve shard allocation bottlenecks.
Use the GET /_cluster/allocation/explain API to view the current shard allocation details. The response shows a "deciders" array where a "throttling" decider with decision "THROTTLE" indicates that the node has reached its limit of outgoing shard recoveries, as defined by the setting
cluster.routing.allocation.node_concurrent_outgoing_recoveries(default 2).
If the decider returns “throttling”, it usually means that the node’s recovery concurrency limit has been hit. When cluster resource utilization is low, you can increase the recovery concurrency parameters to speed up shard allocation; if utilization is high, consider decreasing them.
Key allocation settings include: cluster.routing.allocation.node_initial_primaries_recoveries: number of initial primary shard recoveries (default 2). cluster.routing.allocation.cluster_concurrent_rebalance: number of concurrent shard rebalances. cluster.routing.allocation.node_concurrent_recoveries: total concurrent recoveries per node.
cluster.routing.allocation.node_concurrent_incoming_recoveries: concurrent incoming recoveries per node.
cluster.routing.allocation.node_concurrent_outgoing_recoveries: concurrent outgoing recoveries per node. indices.recovery.max_bytes_per_sec: bandwidth limit for recovery (default 40mb).
Solution
Adjust the relevant parameters as needed. Increase the initial shard recovery count if appropriate, but keep the rebalance count modest to avoid impacting read/write performance; generally set concurrent recovery and allocation values to be less than or equal to the number of CPU cores on a node.
Persistent settings remain after a cluster restart, while transient settings are reset on restart. Use the PUT _cluster/settings API to apply changes. Example:
{
"persistent": {
"cluster.routing.allocation.node_concurrent_recoveries": 8,
"cluster.routing.allocation.node_concurrent_incoming_recoveries": 8,
"cluster.routing.allocation.node_initial_primaries_recoveries": 8,
"cluster.routing.allocation.node_concurrent_outgoing_recoveries": 8,
"cluster.routing.allocation.cluster_concurrent_rebalance": 8,
"indices.recovery.max_bytes_per_sec": "60mb"
},
"transient": {
"cluster.routing.allocation.node_concurrent_recoveries": 8,
"cluster.routing.allocation.node_concurrent_incoming_recoveries": 8,
"cluster.routing.allocation.node_initial_primaries_recoveries": 8,
"cluster.routing.allocation.node_concurrent_outgoing_recoveries": 8,
"cluster.routing.allocation.cluster_concurrent_rebalance": 8,
"indices.recovery.max_bytes_per_sec": "60mb"
}
}If you only need a temporary change, modify the transient settings.
Practical DevOps Architecture
Hands‑on DevOps operations using Docker, K8s, Jenkins, and Ansible—empowering ops professionals to grow together through sharing, discussion, knowledge consolidation, and continuous improvement.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
