How We Rescued a Critical etcd Outage in 4 Hours: Step‑by‑Step Recovery Guide
A midnight Kubernetes disaster caused API server timeouts, etcd health failures, and a full service outage, prompting a detailed investigation, root‑cause analysis of massive database fragmentation, and a four‑stage emergency recovery that restored the cluster within 4 hours while outlining preventive measures.
Incident Overview
At 03:17 on December 15, 2024, a critical alert chain reported a Kubernetes API server timeout, etcd health‑check failures, abnormal pod states, and a complete business service shutdown. The author, an operator with eight years of experience, recognized the severity immediately.
Initial Investigation
SSH into the master node revealed connection refusals: kubectl get nodes and kubectl cluster-info Both commands timed out, confirming that the API server was unresponsive. etcd status showed a degraded service and health checks timed out for all three nodes.
Deep Diagnosis
Log inspection with journalctl -u etcd -n 100 showed entries taking too long and messages about "database space exceeded". Endpoint status displayed DB sizes of 8.2‑8.3 GB on each node, far larger than the normal few hundred megabytes.
Fragmentation analysis using etcdctl endpoint hashkv --cluster and size calculations revealed an actual data size of 156 MB versus a file size of 8.4 GB, yielding a fragmentation rate of 98.1%.
Root Cause Analysis
Three main factors were identified:
Frequent pod‑restart storms: over 2.8 million pod state changes in the past 24 hours.
Historical version accumulation: approximately 1.84 million keys with no compaction, leading to massive version bloat.
Misconfiguration: auto‑compaction disabled (auto‑compaction‑retention: "0") and quota‑backend‑bytes set to 8 GB, which was exceeded.
Emergency Recovery Strategy
The recovery was executed in four stages:
Temporary storage expansion (15 min): increased the quota limit and restarted etcd.
# Temporarily raise quota limit
etcdctl put quota-backend-bytes 12884901888 # 12 GB
systemctl restart etcdManual compaction (45 min): captured the current revision and compacted to retain the latest 1,000 versions.
# Get current revision
rev=$(etcdctl endpoint status --write-out="json" | jq '.[0].Status.header.revision')
echo "Current revision: $rev"
# Compact keeping last 1000 revisions
etcdctl compact $((rev-1000))
watch 'etcdctl endpoint status --write-out=table'Defragmentation (180 min): iterated over each etcd endpoint and ran defragmentation, pausing between nodes.
for endpoint in 10.0.1.10:2379 10.0.1.11:2379 10.0.1.12:2379; do
echo "Defragmenting $endpoint..."
etcdctl --endpoints=$endpoint defrag
sleep 60
doneService verification (23 min): confirmed API server health, node readiness, and that no non‑running pods remained.
# Verify API server
kubectl cluster-info
# Verify nodes
kubectl get nodes
# Verify pods
kubectl get pods --all-namespaces | grep -v Running | wc -lThe database size dropped from 8.4 GB to roughly 180 MB, and all services resumed normal operation.
Preventive Measures
To avoid recurrence, the following were implemented:
Automated periodic compaction:
auto-compaction-mode: periodic
auto-compaction-retention: "5m"Adjusted quota‑backend‑bytes to 8 GB and added Prometheus alerts for low space and high fragmentation:
groups:
- name: etcd-alerts
rules:
- alert: EtcdDatabaseQuotaLowSpace
expr: etcd_mvcc_db_total_size_in_bytes / etcd_server_quota_backend_bytes > 0.8
for: 5m
- alert: EtcdHighFragmentation
expr: (etcd_mvcc_db_total_size_in_bytes - etcd_mvcc_db_total_size_in_use_bytes) / etcd_mvcc_db_total_size_in_bytes > 0.5
for: 10mDaily health‑check script that monitors fragmentation and triggers defragmentation when the rate exceeds 50%:
#!/bin/bash
check_fragmentation() {
for endpoint in $ETCD_ENDPOINTS; do
frag_rate=$(etcdctl endpoint status --endpoints=$endpoint --write-out=json | jq '.[] | ((.Status.dbSize - .Status.dbSizeInUse) / .Status.dbSize * 100)')
if (( $(echo "$frag_rate > 50" | bc -l) )); then
echo "WARNING: $endpoint fragmentation rate: ${frag_rate}%"
etcdctl defrag --endpoints=$endpoint
fi
done
}
check_fragmentationKey Takeaways
Regular monitoring and automated maintenance prevent catastrophic outages.
Automation reduces human error during emergency response.
Well‑designed alert thresholds and a tested recovery playbook are essential for high‑availability services.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
