How We Rescued a Crashed K8s Cluster: etcd 100% Fragmentation Recovery
This article details a P0 production incident where a Kubernetes cluster became completely unresponsive due to 100% etcd database fragmentation, describing the step‑by‑step diagnosis, emergency recovery actions, root‑cause analysis, and long‑term preventive measures for reliable cluster operation.
K8s Cluster Crash Investigation: 100% etcd Database Fragmentation Rescue
Incident Level: P0 | Impact Scope: All Services | Recovery Time: 4h 23m
Disaster Strikes: Early‑Morning Alarm Chain
2024‑12‑15 03:17 – while the on‑call engineer was in bed, a flood of alerts arrived:
CRITICAL: K8s API Server response timeout
CRITICAL: etcd cluster health check failed
CRITICAL: All Pod statuses abnormal
CRITICAL: Business services fully offlineRealising the severity, the engineer immediately suspected a major failure.
Initial Investigation: Symptoms Worse Than Expected
SSH into the master node and check cluster status:
$ kubectl get nodes
The connection to the server localhost:8080 was refused
$ kubectl cluster-info
Unable to connect to the server: dial tcp 10.0.1.10:6443: i/o timeoutThe API server was completely unresponsive, prompting a check of the etcd cluster:
$ systemctl status etcd
● etcd.service - etcd
Active: active (running) but degraded
$ etcdctl endpoint health --cluster
10.0.1.10:2379 is unhealthy: took too long
10.0.1.11:2379 is unhealthy: took too long
10.0.1.12:2379 is unhealthy: took too longAll etcd nodes timed out – not a simple network glitch.
Deep Diagnosis: Shocking Findings
Step 1 – Check etcd Logs
$ journalctl -u etcd -n 100
Dec 15 03:15:23 etcd[1234]: apply entries took too long [2.357658s] for 1 entries
Dec 15 03:15:45 etcd[1234]: database space exceeded
Dec 15 03:16:02 etcd[1234]: mvcc: database space exceededThe log repeatedly reported database space exceeded .
Step 2 – Inspect Database Size
$ etcdctl endpoint status --write-out=table --cluster
+------------------+------------------+---------+---------+-----------+-----------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | RAFT TERM |
+------------------+------------------+---------+---------+-----------+-----------+
| 10.0.1.10:2379 | 8e9e05c52164694d | 3.4.13 | 8.2 GB | true | 5 |
| 10.0.1.11:2379 | 8e9e05c52164694e | 3.4.13 | 8.1 GB | false | 5 |
| 10.0.1.12:2379 | 8e9e05c52164694f | 3.4.13 | 8.3 GB | false | 5 |
+------------------+------------------+---------+---------+-----------+-----------+Each node held over 8 GB of data – far beyond the normal few hundred MB.
Step 3 – Fragmentation Check
$ etcdctl defrag --data-dir=/var/lib/etcd
Failed to defrag etcd member: rpc error: database space exceeded
$ du -sh /var/lib/etcd/
8.4G /var/lib/etcd/Further analysis showed a fragmentation rate of 98.1%:
$ etcdctl endpoint hashkv --cluster
10.0.1.10:2379, 3841678299 (rev 1847293)
10.0.1.11:2379, 3841678299 (rev 1847293)
10.0.1.12:2379, 3841678299 (rev 1847293)
# calculate fragmentation
Actual data size: 156MB
Database file size: 8.4GB
Fragmentation: (8.4GB‑156MB) / 8.4GB = 98.1%Root‑Cause Analysis: Accumulated Historical Events
Log and metric back‑trace revealed three main contributors:
1. Massive Pod Restart Storm
# Count etcd PUT operations for pods
$ grep "PUT /registry/pods" /var/log/etcd.log | wc -l
2847293
# Pods created/deleted in the last 24h
$ kubectl get events --all-namespaces --field-selector reason=Created | wc -l
45623A buggy application restarted over 2.8 million times in a month, flooding etcd.
2. Historical Version Accumulation
$ etcdctl get --prefix --keys-only /registry/ | wc -l
1847293
# Compact old revisions (example)
$ etcdctl compaction $(etcdctl endpoint status --write-out="json" | jq '.[0].Status.header.revision - 1000')More than 1.8 million historic keys remained un‑compacted.
3. Misconfiguration
# /etc/etcd/etcd.conf.yml
auto-compaction-retention: "0" # disabled auto‑compaction
quota-backend-bytes: 8589934592 # 8 GB limit reachedAutomatic compaction was turned off and the quota limit was hit.
Emergency Rescue: Step‑by‑Step Recovery
Phase 1 – Expand Storage (15 min)
# Temporarily raise quota limit
$ etcdctl put quota-backend-bytes 12884901888 # 12 GB
# Restart etcd service
$ systemctl restart etcdPhase 2 – Manual Compaction (45 min)
# Get current revision
rev=$(etcdctl endpoint status --write-out="json" | jq '.[0].Status.header.revision')
# Compact, keeping the latest 1000 revisions
$ etcdctl compact $((rev-1000))
compacted revision 1846293
# Wait for compaction to finish
$ watch 'etcdctl endpoint status --write-out=table'Phase 3 – Defragmentation (180 min)
# Defragment each node sequentially
for endpoint in 10.0.1.10:2379 10.0.1.11:2379 10.0.1.12:2379; do
echo "Defragmenting $endpoint..."
etcdctl --endpoints=$endpoint defrag
sleep 60
done
# Verify size after defragmentation
$ etcdctl endpoint status --write-out=table --cluster
+------------------+------------------+---------+---------+-----------+-----------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | RAFT TERM |
+------------------+------------------+---------+---------+-----------+-----------+
| 10.0.1.10:2379 | 8e9e05c52164694d | 3.4.13 | 178 MB | true | 5 |
| 10.0.1.11:2379 | 8e9e05c52164694e | 3.4.13 | 181 MB | false | 5 |
| 10.0.1.12:2379 | 8e9e05c52164694f | 3.4.13 | 175 MB | false | 5 |
+------------------+------------------+---------+---------+-----------+-----------+Phase 4 – Service Verification (23 min)
# Verify API server
$ kubectl cluster-info
Kubernetes master is running at https://10.0.1.10:6443
# Verify node status
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
k8s-master-1 Ready master 45d v1.19.3
k8s-master-2 Ready master 45d v1.19.3
k8s-master-3 Ready master 45d v1.19.3
k8s-worker-1 Ready <none> 45d v1.19.3
# Verify pods
$ kubectl get pods --all-namespaces | grep -v Running | wc -l
0All services returned to normal operation.
Preventive Measures: Permanent Solutions
1. Automated Compaction Configuration
# Optimised etcd config
auto-compaction-mode: periodic
auto-compaction-retention: "5m"
quota-backend-bytes: 8589934592
max-request-bytes: 15728642. Enhanced Monitoring Alerts
# Prometheus rule for low space
- alert: EtcdDatabaseQuotaLowSpace
expr: etcd_mvcc_db_total_size_in_bytes / etcd_server_quota_backend_bytes > 0.8
for: 5m
# Rule for high fragmentation
- alert: EtcdHighFragmentation
expr: (etcd_mvcc_db_total_size_in_bytes - etcd_mvcc_db_total_size_in_use_bytes) / etcd_mvcc_db_total_size_in_bytes > 0.5
for: 10m3. Automated Health‑Check Script
#!/bin/bash
# etcd-health-check.sh – daily health check
check_fragmentation() {
for endpoint in $ETCD_ENDPOINTS; do
frag_rate=$(etcdctl endpoint status --endpoints=$endpoint --write-out=json | jq -r '(.Status.dbSize - .Status.dbSizeInUse) / .Status.dbSize * 100')
if (( $(echo "$frag_rate > 50" | bc -l) )); then
echo "WARNING: $endpoint fragmentation rate: $frag_rate%"
etcdctl defrag --endpoints=$endpoint
fi
done
}By automating compaction, monitoring, and health checks, the cluster is protected against future fragmentation disasters.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
