Operations 10 min read

Why Your Elasticsearch Cluster Stalls at Red and How to Recover It Fast

A large foreign‑enterprise Elasticsearch cluster with 10 TB of data and 200 shards got stuck in a red state after a restart, prompting a detailed diagnosis and step‑by‑step recovery plan that includes shard actions, recovery API tuning, delayed allocation, speed limits, and cautious index deletion to restore normal operation.

dbaplus Community

Apr 24, 2023

Why Your Elasticsearch Cluster Stalls at Red and How to Recover It Fast

Background

The discussion originates from a WeChat and Tencent meeting about a foreign‑enterprise Elasticsearch cluster (version 7.17.4) holding roughly 10 TB of data across 200 primary shards. After a restart that lasted over 20 hours, the cluster remained in a red state, with recovery stuck around 30 % and Kibana unable to connect.

Symptoms and Errors

Symptoms included prolonged startup time (previously 8 hours, now failing), and the following error trace:

Caused by: org.elasticsearch.action.UnavailableShardsException: [.monitoring‑kibana‑7‑2023.01.17][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[.monitoring‑kibana‑7‑2023.01.17][0]] containing [2] requests] ... 11 more

Additional warning logs showed unexpected errors while indexing monitoring documents.

Proposed Shard‑Level Actions

Delete (cancel) the problematic shards.

Move shards to another node.

Allocate the shards to a node.

Update number_of_replicas to 2.

Consider other custom actions.

Root Cause Findings

Cluster planning was poor: 10 TB+ data on only two default‑role nodes.

Several huge single‑index shards (e.g., a 600 GB index) existed.

Developers and ops were unaware of the number of shards and replicas configured.

The cluster had previously taken 8 hours to start; the issue became critical during the Chinese New Year holiday.

These factors left the cluster stuck in red, unable to allocate primary shards.

Recovery Strategies

1. Use the Recovery API

Run the following request to list ongoing and completed shard recoveries:

GET _cat/recovery?v=true&h=i,s,t,ty,st,shost,thost,f,fp,b,bp&s=index:desc

2. Tune Concurrent Recoveries

Increase the number of concurrent incoming and outgoing recoveries:

PUT _cluster/settings
{
  "transient": {
    "cluster.routing.allocation.node_concurrent_recoveries": 3
  }
}

The default is 2; raising it can speed up recovery if resources allow.

3. Delayed Allocation Policy

Set a delayed timeout for unassigned shards when a node leaves the cluster:

PUT _all/_settings
{
  "settings": {
    "index.unassigned.node_left.delayed_timeout": "6m"
  }
}

This keeps the cluster in a yellow state (instead of red) while the departed node is expected to return quickly, avoiding costly full re‑allocation.

4. Limit Recovery Speed to Prevent Overload

Adjust the maximum recovery bandwidth:

PUT _cluster/settings
{
  "transient": {
    "indices.recovery.max_bytes_per_sec": "100mb"
  }
}

Lower values protect cluster stability; higher values speed up recovery but may exhaust resources.

Index‑Deletion Acceleration

When large, obsolete indices exist, physically deleting them can reduce startup pressure. The steps are:

Identify the index UUID to delete, e.g., znUfwfE3Rt22GMMqANMbQQ, via GET _cat/indices?v&s=docs.count:desc.

Locate the index files on the Elasticsearch data path, e.g., ./indices/znUfwfE3Rt22GMMqANMbQQ.

Back up the data, then remove the directory manually.

Restart the cluster.

This method was tested on a single‑node cluster and worked, but the author warns it should be a last resort and only performed after backup.

Conclusion

The cluster’s prolonged red state stemmed from poor shard and replica planning, oversized indices, and lack of operational awareness. Applying the recovery API, increasing concurrent recoveries, using delayed allocation, and carefully limiting recovery bandwidth can restore health without drastic measures. Index deletion may be considered only when safe and after thorough verification.

Key take‑aways: maintain proper shard sizing, use ILM for large indices, keep an Elasticsearch‑savvy person on‑call, and validate changes in a test environment before applying to production.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Index management Shard Allocation Cluster Recovery Recovery API

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.