Tagged articles

Cluster Recovery

11 articles · Page 1 of 1

Oct 12, 2025 · Operations

When etcd Certificates Expire: How One Failure Crippled an Entire Kubernetes Cluster

A midnight alarm revealed that an expired etcd TLS certificate caused a cascade of failures across a Kubernetes cluster, leading to a full outage that took over half an hour to diagnose, remediate, and restore, highlighting the critical need for proactive certificate management and automated monitoring.

Cluster RecoveryEtcdcertificate expiration

0 likes · 44 min read

When etcd Certificates Expire: How One Failure Crippled an Entire Kubernetes Cluster

MaGe Linux Operations

Jul 23, 2025 · Operations

How We Rescued a Crashed K8s Cluster: etcd 100% Fragmentation Recovery

This article details a P0 production incident where a Kubernetes cluster became completely unresponsive due to 100% etcd database fragmentation, describing the step‑by‑step diagnosis, emergency recovery actions, root‑cause analysis, and long‑term preventive measures for reliable cluster operation.

Cluster RecoveryEtcdMonitoring

0 likes · 12 min read

How We Rescued a Crashed K8s Cluster: etcd 100% Fragmentation Recovery

dbaplus Community

Mar 5, 2024 · Operations

How to Recover a Failing Elasticsearch Cluster: Master Loss, Shard Corruption, and More

This guide explains Elasticsearch cluster architecture, node roles, and metadata storage, then details step‑by‑step recovery procedures for master‑node loss, complete master outage, data‑node failures, shard allocation problems, corrupted shards, translog issues, and missing segment files, including relevant API commands and tool usage.

Cluster RecoveryData NodeElasticsearch

0 likes · 17 min read

How to Recover a Failing Elasticsearch Cluster: Master Loss, Shard Corruption, and More

Sohu Tech Products

Feb 21, 2024 · Operations

Troubleshooting and Recovery of ZooKeeper Election Port Failure in a Codis Cache Cluster

When adding a ZooKeeper observer to a Codis cache cluster, the election port (3888) was unreachable because the QuorumCnxManager listener thread vanished, prompting telnet and log checks, and leading to a successful recovery by rolling upgrade to ZooKeeper 3.4.13, rebuilding the data directory, performing a rolling restart, and decommissioning the temporary node, thereby restoring full cluster quorum and normal Codis‑Proxy operation.

Cluster RecoveryQuorumCnxManagerZookeeper

0 likes · 10 min read

Troubleshooting and Recovery of ZooKeeper Election Port Failure in a Codis Cache Cluster

Zhuanzhuan Tech

Feb 7, 2024 · Operations

Recovering a ZooKeeper Cluster with Codis: Diagnosis, Testing, and Migration Strategies

This article details a real‑world investigation of a ZooKeeper election‑port failure that prevented adding observer nodes to a Codis cache cluster, outlines systematic connectivity checks, log analysis, and two migration plans, and finally presents step‑by‑step procedures for rolling upgrades, configuration adjustments, and successful cluster restoration.

Cluster RecoveryCodisZookeeper

0 likes · 12 min read

Recovering a ZooKeeper Cluster with Codis: Diagnosis, Testing, and Migration Strategies

dbaplus Community

Apr 24, 2023 · Operations

Why Your Elasticsearch Cluster Stalls at Red and How to Recover It Fast

A large foreign‑enterprise Elasticsearch cluster with 10 TB of data and 200 shards got stuck in a red state after a restart, prompting a detailed diagnosis and step‑by‑step recovery plan that includes shard actions, recovery API tuning, delayed allocation, speed limits, and cautious index deletion to restore normal operation.

Cluster RecoveryIndex managementRecovery API

0 likes · 10 min read

Why Your Elasticsearch Cluster Stalls at Red and How to Recover It Fast

dbaplus Community

Mar 7, 2023 · Operations

How We Rescued a ClickHouse Logging Cluster After Zookeeper‑Induced Read‑Only Failure

A production logging system became unavailable due to Kafka backlog alerts, prompting an investigation that uncovered read‑only ClickHouse tables caused by mismatched Zookeeper metadata after a TTL policy change, leading to a step‑by‑step recovery involving Zookeeper restarts, metadata fixes, and table reconstruction.

ClickHouseCluster RecoveryFlink

0 likes · 9 min read

How We Rescued a ClickHouse Logging Cluster After Zookeeper‑Induced Read‑Only Failure

Xiaolei Talks DB

Mar 16, 2022 · Operations

How to Recover a TiKV Cluster After Multiple Node Failures

This article demonstrates how to simulate and recover TiKV cluster failures by shutting down one, two, or three nodes, explains the impact on Raft groups and region availability, and provides step‑by‑step commands for disabling PD scheduling, using tikv‑ctl, and restoring data integrity.

Cluster RecoveryData lossPD

0 likes · 28 min read

How to Recover a TiKV Cluster After Multiple Node Failures

Ops Development Stories

Feb 25, 2022 · Operations

Recovering a Ceph 16 Cluster After System Disk Failure

This guide walks through the step‑by‑step process of restoring a Ceph 16 cluster when a node's system disk fails, covering host removal, node re‑initialization, Docker and Cephadm installation, host addition, labeling, OSD recreation, and final verification.

CephCluster RecoveryOperations

0 likes · 7 min read

Recovering a Ceph 16 Cluster After System Disk Failure

Tencent Database Technology

Feb 27, 2019 · Operations

Elasticsearch Cluster Recovery Pitfall: Excessive Shard Recovery Concurrency Leads to Cluster Hang

This article details a real‑world Elasticsearch cluster recovery issue where setting the shard recovery concurrency too high saturated the generic thread pool, causing the entire cluster to hang, and explains the underlying concepts, reproduction steps, analysis, and mitigation measures.

Cluster Recoveryshard-recoverythread-pool

0 likes · 10 min read

Elasticsearch Cluster Recovery Pitfall: Excessive Shard Recovery Concurrency Leads to Cluster Hang

MaGe Linux Operations

Aug 12, 2014 · Operations

How to Detect and Recover from RabbitMQ Network Partitions

This article explains why RabbitMQ clusters struggle with network partitions, how to detect partition events via logs and rabbitmqctl, the impact on queues and bindings, and step‑by‑step methods—including manual recovery commands and automatic handling modes—to restore a healthy cluster.

Cluster RecoveryOperationsRabbitMQ

0 likes · 7 min read

How to Detect and Recover from RabbitMQ Network Partitions