Tagged articles
11 articles
Page 1 of 1
Ops Community
Ops Community
Oct 12, 2025 · Operations

When etcd Certificates Expire: How One Failure Crippled an Entire Kubernetes Cluster

A midnight alarm revealed that an expired etcd TLS certificate caused a cascade of failures across a Kubernetes cluster, leading to a full outage that took over half an hour to diagnose, remediate, and restore, highlighting the critical need for proactive certificate management and automated monitoring.

Cluster RecoveryKubernetescertificate expiration
0 likes · 44 min read
When etcd Certificates Expire: How One Failure Crippled an Entire Kubernetes Cluster
MaGe Linux Operations
MaGe Linux Operations
Jul 23, 2025 · Operations

How We Rescued a Crashed K8s Cluster: etcd 100% Fragmentation Recovery

This article details a P0 production incident where a Kubernetes cluster became completely unresponsive due to 100% etcd database fragmentation, describing the step‑by‑step diagnosis, emergency recovery actions, root‑cause analysis, and long‑term preventive measures for reliable cluster operation.

Cluster RecoveryKubernetesOperations
0 likes · 12 min read
How We Rescued a Crashed K8s Cluster: etcd 100% Fragmentation Recovery
dbaplus Community
dbaplus Community
Mar 5, 2024 · Operations

How to Recover a Failing Elasticsearch Cluster: Master Loss, Shard Corruption, and More

This guide explains Elasticsearch cluster architecture, node roles, and metadata storage, then details step‑by‑step recovery procedures for master‑node loss, complete master outage, data‑node failures, shard allocation problems, corrupted shards, translog issues, and missing segment files, including relevant API commands and tool usage.

Cluster RecoveryData NodeElasticsearch
0 likes · 17 min read
How to Recover a Failing Elasticsearch Cluster: Master Loss, Shard Corruption, and More
Sohu Tech Products
Sohu Tech Products
Feb 21, 2024 · Operations

Troubleshooting and Recovery of ZooKeeper Election Port Failure in a Codis Cache Cluster

When adding a ZooKeeper observer to a Codis cache cluster, the election port (3888) was unreachable because the QuorumCnxManager listener thread vanished, prompting telnet and log checks, and leading to a successful recovery by rolling upgrade to ZooKeeper 3.4.13, rebuilding the data directory, performing a rolling restart, and decommissioning the temporary node, thereby restoring full cluster quorum and normal Codis‑Proxy operation.

Cluster RecoveryQuorumCnxManagerVersion Upgrade
0 likes · 10 min read
Troubleshooting and Recovery of ZooKeeper Election Port Failure in a Codis Cache Cluster
Zhuanzhuan Tech
Zhuanzhuan Tech
Feb 7, 2024 · Operations

Recovering a ZooKeeper Cluster with Codis: Diagnosis, Testing, and Migration Strategies

This article details a real‑world investigation of a ZooKeeper election‑port failure that prevented adding observer nodes to a Codis cache cluster, outlines systematic connectivity checks, log analysis, and two migration plans, and finally presents step‑by‑step procedures for rolling upgrades, configuration adjustments, and successful cluster restoration.

Cluster RecoveryCodisLog Management
0 likes · 12 min read
Recovering a ZooKeeper Cluster with Codis: Diagnosis, Testing, and Migration Strategies
dbaplus Community
dbaplus Community
Apr 24, 2023 · Operations

Why Your Elasticsearch Cluster Stalls at Red and How to Recover It Fast

A large foreign‑enterprise Elasticsearch cluster with 10 TB of data and 200 shards got stuck in a red state after a restart, prompting a detailed diagnosis and step‑by‑step recovery plan that includes shard actions, recovery API tuning, delayed allocation, speed limits, and cautious index deletion to restore normal operation.

Cluster RecoveryIndex ManagementRecovery API
0 likes · 10 min read
Why Your Elasticsearch Cluster Stalls at Red and How to Recover It Fast
dbaplus Community
dbaplus Community
Mar 7, 2023 · Operations

How We Rescued a ClickHouse Logging Cluster After Zookeeper‑Induced Read‑Only Failure

A production logging system became unavailable due to Kafka backlog alerts, prompting an investigation that uncovered read‑only ClickHouse tables caused by mismatched Zookeeper metadata after a TTL policy change, leading to a step‑by‑step recovery involving Zookeeper restarts, metadata fixes, and table reconstruction.

ClickHouseCluster RecoveryFlink
0 likes · 9 min read
How We Rescued a ClickHouse Logging Cluster After Zookeeper‑Induced Read‑Only Failure
Xiaolei Talks DB
Xiaolei Talks DB
Mar 16, 2022 · Operations

How to Recover a TiKV Cluster After Multiple Node Failures

This article demonstrates how to simulate and recover TiKV cluster failures by shutting down one, two, or three nodes, explains the impact on Raft groups and region availability, and provides step‑by‑step commands for disabling PD scheduling, using tikv‑ctl, and restoring data integrity.

Cluster RecoveryData lossPD
0 likes · 28 min read
How to Recover a TiKV Cluster After Multiple Node Failures
Ops Development Stories
Ops Development Stories
Feb 25, 2022 · Operations

Recovering a Ceph 16 Cluster After System Disk Failure

This guide walks through the step‑by‑step process of restoring a Ceph 16 cluster when a node's system disk fails, covering host removal, node re‑initialization, Docker and Cephadm installation, host addition, labeling, OSD recreation, and final verification.

CephCluster RecoveryOperations
0 likes · 7 min read
Recovering a Ceph 16 Cluster After System Disk Failure
Tencent Database Technology
Tencent Database Technology
Feb 27, 2019 · Operations

Elasticsearch Cluster Recovery Pitfall: Excessive Shard Recovery Concurrency Leads to Cluster Hang

This article details a real‑world Elasticsearch cluster recovery issue where setting the shard recovery concurrency too high saturated the generic thread pool, causing the entire cluster to hang, and explains the underlying concepts, reproduction steps, analysis, and mitigation measures.

Cluster Recoveryshard-recoverythread-pool
0 likes · 10 min read
Elasticsearch Cluster Recovery Pitfall: Excessive Shard Recovery Concurrency Leads to Cluster Hang
MaGe Linux Operations
MaGe Linux Operations
Aug 12, 2014 · Operations

How to Detect and Recover from RabbitMQ Network Partitions

This article explains why RabbitMQ clusters struggle with network partitions, how to detect partition events via logs and rabbitmqctl, the impact on queues and bindings, and step‑by‑step methods—including manual recovery commands and automatic handling modes—to restore a healthy cluster.

BackendCluster RecoveryOperations
0 likes · 7 min read
How to Detect and Recover from RabbitMQ Network Partitions