Operations 12 min read

How We Rescued a Crashed K8s Cluster: etcd 100% Fragmentation Recovery

This article details a P0 production incident where a Kubernetes cluster became completely unresponsive due to 100% etcd database fragmentation, describing the step‑by‑step diagnosis, emergency recovery actions, root‑cause analysis, and long‑term preventive measures for reliable cluster operation.

MaGe Linux Operations

Jul 23, 2025

How We Rescued a Crashed K8s Cluster: etcd 100% Fragmentation Recovery

K8s Cluster Crash Investigation: 100% etcd Database Fragmentation Rescue

Incident Level: P0 | Impact Scope: All Services | Recovery Time: 4h 23m

Disaster Strikes: Early‑Morning Alarm Chain

2024‑12‑15 03:17 – while the on‑call engineer was in bed, a flood of alerts arrived:

CRITICAL: K8s API Server response timeout
CRITICAL: etcd cluster health check failed
CRITICAL: All Pod statuses abnormal
CRITICAL: Business services fully offline

Realising the severity, the engineer immediately suspected a major failure.

Initial Investigation: Symptoms Worse Than Expected

SSH into the master node and check cluster status:

$ kubectl get nodes
The connection to the server localhost:8080 was refused

$ kubectl cluster-info
Unable to connect to the server: dial tcp 10.0.1.10:6443: i/o timeout

The API server was completely unresponsive, prompting a check of the etcd cluster:

$ systemctl status etcd
● etcd.service - etcd
   Active: active (running) but degraded

$ etcdctl endpoint health --cluster
10.0.1.10:2379 is unhealthy: took too long
10.0.1.11:2379 is unhealthy: took too long
10.0.1.12:2379 is unhealthy: took too long

All etcd nodes timed out – not a simple network glitch.

Deep Diagnosis: Shocking Findings

Step 1 – Check etcd Logs

$ journalctl -u etcd -n 100
Dec 15 03:15:23 etcd[1234]: apply entries took too long [2.357658s] for 1 entries
Dec 15 03:15:45 etcd[1234]: database space exceeded
Dec 15 03:16:02 etcd[1234]: mvcc: database space exceeded

The log repeatedly reported database space exceeded .

Step 2 – Inspect Database Size

$ etcdctl endpoint status --write-out=table --cluster
+------------------+------------------+---------+---------+-----------+-----------+
| ENDPOINT         | ID               | VERSION | DB SIZE | IS LEADER | RAFT TERM |
+------------------+------------------+---------+---------+-----------+-----------+
| 10.0.1.10:2379   | 8e9e05c52164694d | 3.4.13  | 8.2 GB  | true      | 5         |
| 10.0.1.11:2379   | 8e9e05c52164694e | 3.4.13  | 8.1 GB  | false     | 5         |
| 10.0.1.12:2379   | 8e9e05c52164694f | 3.4.13  | 8.3 GB  | false     | 5         |
+------------------+------------------+---------+---------+-----------+-----------+

Each node held over 8 GB of data – far beyond the normal few hundred MB.

Step 3 – Fragmentation Check

$ etcdctl defrag --data-dir=/var/lib/etcd
Failed to defrag etcd member: rpc error: database space exceeded

$ du -sh /var/lib/etcd/
8.4G    /var/lib/etcd/

Further analysis showed a fragmentation rate of 98.1%:

$ etcdctl endpoint hashkv --cluster
10.0.1.10:2379, 3841678299 (rev 1847293)
10.0.1.11:2379, 3841678299 (rev 1847293)
10.0.1.12:2379, 3841678299 (rev 1847293)

# calculate fragmentation
Actual data size: 156MB
Database file size: 8.4GB
Fragmentation: (8.4GB‑156MB) / 8.4GB = 98.1%

Root‑Cause Analysis: Accumulated Historical Events

Log and metric back‑trace revealed three main contributors:

1. Massive Pod Restart Storm

# Count etcd PUT operations for pods
$ grep "PUT /registry/pods" /var/log/etcd.log | wc -l
2847293
# Pods created/deleted in the last 24h
$ kubectl get events --all-namespaces --field-selector reason=Created | wc -l
45623

A buggy application restarted over 2.8 million times in a month, flooding etcd.

2. Historical Version Accumulation

$ etcdctl get --prefix --keys-only /registry/ | wc -l
1847293
# Compact old revisions (example)
$ etcdctl compaction $(etcdctl endpoint status --write-out="json" | jq '.[0].Status.header.revision - 1000')

More than 1.8 million historic keys remained un‑compacted.

3. Misconfiguration

# /etc/etcd/etcd.conf.yml
auto-compaction-retention: "0"   # disabled auto‑compaction
quota-backend-bytes: 8589934592 # 8 GB limit reached

Automatic compaction was turned off and the quota limit was hit.

Emergency Rescue: Step‑by‑Step Recovery

Phase 1 – Expand Storage (15 min)

# Temporarily raise quota limit
$ etcdctl put quota-backend-bytes 12884901888   # 12 GB
# Restart etcd service
$ systemctl restart etcd

Phase 2 – Manual Compaction (45 min)

# Get current revision
rev=$(etcdctl endpoint status --write-out="json" | jq '.[0].Status.header.revision')
# Compact, keeping the latest 1000 revisions
$ etcdctl compact $((rev-1000))
compacted revision 1846293
# Wait for compaction to finish
$ watch 'etcdctl endpoint status --write-out=table'

Phase 3 – Defragmentation (180 min)

# Defragment each node sequentially
for endpoint in 10.0.1.10:2379 10.0.1.11:2379 10.0.1.12:2379; do
  echo "Defragmenting $endpoint..."
  etcdctl --endpoints=$endpoint defrag
  sleep 60
done
# Verify size after defragmentation
$ etcdctl endpoint status --write-out=table --cluster
+------------------+------------------+---------+---------+-----------+-----------+
| ENDPOINT         | ID               | VERSION | DB SIZE | IS LEADER | RAFT TERM |
+------------------+------------------+---------+---------+-----------+-----------+
| 10.0.1.10:2379   | 8e9e05c52164694d | 3.4.13  | 178 MB  | true      | 5         |
| 10.0.1.11:2379   | 8e9e05c52164694e | 3.4.13  | 181 MB  | false     | 5         |
| 10.0.1.12:2379   | 8e9e05c52164694f | 3.4.13  | 175 MB  | false     | 5         |
+------------------+------------------+---------+---------+-----------+-----------+

Phase 4 – Service Verification (23 min)

# Verify API server
$ kubectl cluster-info
Kubernetes master is running at https://10.0.1.10:6443

# Verify node status
$ kubectl get nodes
NAME          STATUS   ROLES   AGE   VERSION
k8s-master-1  Ready    master  45d   v1.19.3
k8s-master-2  Ready    master  45d   v1.19.3
k8s-master-3  Ready    master  45d   v1.19.3
k8s-worker-1  Ready    <none>  45d   v1.19.3

# Verify pods
$ kubectl get pods --all-namespaces | grep -v Running | wc -l
0

All services returned to normal operation.

Preventive Measures: Permanent Solutions

1. Automated Compaction Configuration

# Optimised etcd config
auto-compaction-mode: periodic
auto-compaction-retention: "5m"
quota-backend-bytes: 8589934592
max-request-bytes: 1572864

2. Enhanced Monitoring Alerts

# Prometheus rule for low space
- alert: EtcdDatabaseQuotaLowSpace
  expr: etcd_mvcc_db_total_size_in_bytes / etcd_server_quota_backend_bytes > 0.8
  for: 5m

# Rule for high fragmentation
- alert: EtcdHighFragmentation
  expr: (etcd_mvcc_db_total_size_in_bytes - etcd_mvcc_db_total_size_in_use_bytes) / etcd_mvcc_db_total_size_in_bytes > 0.5
  for: 10m

3. Automated Health‑Check Script

#!/bin/bash
# etcd-health-check.sh – daily health check
check_fragmentation() {
  for endpoint in $ETCD_ENDPOINTS; do
    frag_rate=$(etcdctl endpoint status --endpoints=$endpoint --write-out=json | jq -r '(.Status.dbSize - .Status.dbSizeInUse) / .Status.dbSize * 100')
    if (( $(echo "$frag_rate > 50" | bc -l) )); then
      echo "WARNING: $endpoint fragmentation rate: $frag_rate%"
      etcdctl defrag --endpoints=$endpoint
    fi
  done
}

By automating compaction, monitoring, and health checks, the cluster is protected against future fragmentation disasters.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Monitoring Operations kubernetes Etcd Cluster Recovery database fragmentation

Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

K8s Cluster Crash Investigation: 100% etcd Database Fragmentation Rescue

Disaster Strikes: Early‑Morning Alarm Chain

Initial Investigation: Symptoms Worse Than Expected

Deep Diagnosis: Shocking Findings

Step 1 – Check etcd Logs

Step 2 – Inspect Database Size

Step 3 – Fragmentation Check

Root‑Cause Analysis: Accumulated Historical Events

1. Massive Pod Restart Storm

2. Historical Version Accumulation

3. Misconfiguration

Emergency Rescue: Step‑by‑Step Recovery

Phase 1 – Expand Storage (15 min)

Phase 2 – Manual Compaction (45 min)

Phase 3 – Defragmentation (180 min)

Phase 4 – Service Verification (23 min)

Preventive Measures: Permanent Solutions

1. Automated Compaction Configuration

2. Enhanced Monitoring Alerts

3. Automated Health‑Check Script

MaGe Linux Operations

How this landed with the community

Was this worth your time?

0 Comments

Phase 1 – Expand Storage (15 min)

Phase 2 – Manual Compaction (45 min)

Phase 3 – Defragmentation (180 min)

Phase 4 – Service Verification (23 min)