Tagged articles
2 articles
Page 1 of 1
dbaplus Community
dbaplus Community
Nov 24, 2025 · Operations

How We Rescued a Critical etcd Outage in 4 Hours: Step‑by‑Step Recovery Guide

A midnight Kubernetes disaster caused API server timeouts, etcd health failures, and a full service outage, prompting a detailed investigation, root‑cause analysis of massive database fragmentation, and a four‑stage emergency recovery that restored the cluster within 4 hours while outlining preventive measures.

KubernetesOperationsdatabase fragmentation
0 likes · 10 min read
How We Rescued a Critical etcd Outage in 4 Hours: Step‑by‑Step Recovery Guide
MaGe Linux Operations
MaGe Linux Operations
Jul 23, 2025 · Operations

How We Rescued a Crashed K8s Cluster: etcd 100% Fragmentation Recovery

This article details a P0 production incident where a Kubernetes cluster became completely unresponsive due to 100% etcd database fragmentation, describing the step‑by‑step diagnosis, emergency recovery actions, root‑cause analysis, and long‑term preventive measures for reliable cluster operation.

Cluster RecoveryKubernetesOperations
0 likes · 12 min read
How We Rescued a Crashed K8s Cluster: etcd 100% Fragmentation Recovery