Operations 10 min read

How We Rescued a Critical etcd Outage in 4 Hours: Step‑by‑Step Recovery Guide

A midnight Kubernetes disaster caused API server timeouts, etcd health failures, and a full service outage, prompting a detailed investigation, root‑cause analysis of massive database fragmentation, and a four‑stage emergency recovery that restored the cluster within 4 hours while outlining preventive measures.

dbaplus Community
dbaplus Community
dbaplus Community
How We Rescued a Critical etcd Outage in 4 Hours: Step‑by‑Step Recovery Guide

Incident Overview

At 03:17 on December 15, 2024, a critical alert chain reported a Kubernetes API server timeout, etcd health‑check failures, abnormal pod states, and a complete business service shutdown. The author, an operator with eight years of experience, recognized the severity immediately.

Initial Investigation

SSH into the master node revealed connection refusals: kubectl get nodes and kubectl cluster-info Both commands timed out, confirming that the API server was unresponsive. etcd status showed a degraded service and health checks timed out for all three nodes.

Deep Diagnosis

Log inspection with journalctl -u etcd -n 100 showed entries taking too long and messages about "database space exceeded". Endpoint status displayed DB sizes of 8.2‑8.3 GB on each node, far larger than the normal few hundred megabytes.

Fragmentation analysis using etcdctl endpoint hashkv --cluster and size calculations revealed an actual data size of 156 MB versus a file size of 8.4 GB, yielding a fragmentation rate of 98.1%.

Root Cause Analysis

Three main factors were identified:

Frequent pod‑restart storms: over 2.8 million pod state changes in the past 24 hours.

Historical version accumulation: approximately 1.84 million keys with no compaction, leading to massive version bloat.

Misconfiguration: auto‑compaction disabled (auto‑compaction‑retention: "0") and quota‑backend‑bytes set to 8 GB, which was exceeded.

Emergency Recovery Strategy

The recovery was executed in four stages:

Temporary storage expansion (15 min): increased the quota limit and restarted etcd.

# Temporarily raise quota limit
etcdctl put quota-backend-bytes 12884901888  # 12 GB
systemctl restart etcd

Manual compaction (45 min): captured the current revision and compacted to retain the latest 1,000 versions.

# Get current revision
rev=$(etcdctl endpoint status --write-out="json" | jq '.[0].Status.header.revision')
echo "Current revision: $rev"
# Compact keeping last 1000 revisions
etcdctl compact $((rev-1000))
watch 'etcdctl endpoint status --write-out=table'

Defragmentation (180 min): iterated over each etcd endpoint and ran defragmentation, pausing between nodes.

for endpoint in 10.0.1.10:2379 10.0.1.11:2379 10.0.1.12:2379; do
  echo "Defragmenting $endpoint..."
  etcdctl --endpoints=$endpoint defrag
  sleep 60
done

Service verification (23 min): confirmed API server health, node readiness, and that no non‑running pods remained.

# Verify API server
kubectl cluster-info
# Verify nodes
kubectl get nodes
# Verify pods
kubectl get pods --all-namespaces | grep -v Running | wc -l

The database size dropped from 8.4 GB to roughly 180 MB, and all services resumed normal operation.

Preventive Measures

To avoid recurrence, the following were implemented:

Automated periodic compaction:

auto-compaction-mode: periodic
auto-compaction-retention: "5m"

Adjusted quota‑backend‑bytes to 8 GB and added Prometheus alerts for low space and high fragmentation:

groups:
- name: etcd-alerts
  rules:
  - alert: EtcdDatabaseQuotaLowSpace
    expr: etcd_mvcc_db_total_size_in_bytes / etcd_server_quota_backend_bytes > 0.8
    for: 5m
  - alert: EtcdHighFragmentation
    expr: (etcd_mvcc_db_total_size_in_bytes - etcd_mvcc_db_total_size_in_use_bytes) / etcd_mvcc_db_total_size_in_bytes > 0.5
    for: 10m

Daily health‑check script that monitors fragmentation and triggers defragmentation when the rate exceeds 50%:

#!/bin/bash
check_fragmentation() {
  for endpoint in $ETCD_ENDPOINTS; do
    frag_rate=$(etcdctl endpoint status --endpoints=$endpoint --write-out=json | jq '.[] | ((.Status.dbSize - .Status.dbSizeInUse) / .Status.dbSize * 100)')
    if (( $(echo "$frag_rate > 50" | bc -l) )); then
      echo "WARNING: $endpoint fragmentation rate: ${frag_rate}%"
      etcdctl defrag --endpoints=$endpoint
    fi
  done
}
check_fragmentation

Key Takeaways

Regular monitoring and automated maintenance prevent catastrophic outages.

Automation reduces human error during emergency response.

Well‑designed alert thresholds and a tested recovery playbook are essential for high‑availability services.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

OperationsKubernetesetcddatabase fragmentation
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.