Cloud Native 16 min read

How to Safely Backup and Restore etcd in a Kubernetes Cluster

This guide explains why etcd is critical for Kubernetes, walks through creating snapshots with etcdctl, automating backups via scripts and cron, and details step‑by‑step procedures for restoring a failed etcd cluster, including stopping services, cleaning data directories, and restarting components to recover the whole cluster.

Efficient Ops
Efficient Ops
Efficient Ops
How to Safely Backup and Restore etcd in a Kubernetes Cluster

etcd is a critical component of a Kubernetes cluster, storing all cluster state such as namespaces, pods, services, and routing information. Loss of etcd data can prevent cluster recovery, so backing up etcd is essential for disaster recovery.

1. etcd Cluster Backup

Different etcd versions use slightly different etcdctl commands, but the snapshot save operation is common. The backup can be performed on any single etcd node.

Execute the backup on one etcd node.

Since Kubernetes 1.13 only supports etcd v3, the backup only includes v3 data.

The example uses a binary‑deployed Kubernetes v1.18.6 with Calico.

1) View etcd data directories

<code>etcd data directory:
[root@k8s-master01 ~]# cat /opt/k8s/bin/environment.sh | grep "ETCD_DATA_DIR="
export ETCD_DATA_DIR="/data/k8s/etcd/data"

etcd WAL directory:
[root@k8s-master01 ~]# cat /opt/k8s/bin/environment.sh | grep "ETCD_WAL_DIR="
export ETCD_WAL_DIR="/data/k8s/etcd/wal"

# list directories
[root@k8s-master01 ~]# ls /data/k8s/etcd/data/
member
[root@k8s-master01 ~]# ls /data/k8s/etcd/data/member/
snap
[root@k8s-master01 ~]# ls /data/k8s/etcd/wal/
0000000000000000-0000000000000000.wal  0.tmp</code>

2) Perform the snapshot backup

<code># mkdir -p /data/etcd_backup_dir
ETCDCTL_API=3 etcdctl --cacert=/etc/kubernetes/cert/ca.pem \
  --cert=/etc/etcd/cert/etcd.pem \
  --key=/etc/etcd/cert/etcd-key.pem \
  --endpoints=https://172.16.60.231:2379 \
  snapshot save /data/etcd_backup_dir/etcd-snapshot-`date +%Y%m%d`.db</code>

Copy the snapshot to the other etcd nodes:

<code># rsync -e "ssh -p22" -avpgolr /data/etcd_backup_dir/etcd-snapshot-20200820.db root@k8s-master02:/data/etcd_backup_dir/
# rsync -e "ssh -p22" -avpgolr /data/etcd_backup_dir/etcd-snapshot-20200820.db root@k8s-master03:/data/etcd_backup_dir/</code>

Schedule the backup with cron (example runs daily at 5 AM):

<code># chmod 755 /data/etcd_backup_dir/etcd_backup.sh
# crontab -l
0 5 * * * /bin/bash -x /data/etcd_backup_dir/etcd_backup.sh > /dev/null 2>&1</code>

2. etcd Cluster Restore

Backup can be taken from a single node, but restoration must be performed on every etcd node.

Simulate data loss

<code># rm -rf /data/k8s/etcd/data/*</code>

Check cluster health (initially unhealthy, then recovers after etcd restarts):

<code># kubectl get cs
# ETCDCTL_API=3 etcdctl --endpoints="https://172.16.60.231:2379,https://172.16.60.232:2379,https://172.16.60.233:2379" \
  --cert=/etc/etcd/cert/etcd.pem \
  --key=/etc/etcd/cert/etcd-key.pem \
  --cacert=/etc/kubernetes/cert/ca.pem endpoint health</code>

Stop all master kube‑apiserver services and etcd services before restoring:

<code># systemctl stop kube-apiserver
# systemctl stop etcd</code>

Delete old data and WAL directories on each node (otherwise restore will fail):

<code># rm -rf /data/k8s/etcd/data && rm -rf /data/k8s/etcd/wal</code>

Run the restore command on each node (replace the IP address accordingly):

<code># Node 172.16.60.231
ETCDCTL_API=3 etcdctl \
  --name=k8s-etcd01 \
  --endpoints="https://172.16.60.231:2379" \
  --cert=/etc/etcd/cert/etcd.pem \
  --key=/etc/etcd/cert/etcd-key.pem \
  --cacert=/etc/kubernetes/cert/ca.pem \
  --initial-cluster-token=etcd-cluster-0 \
  --initial-advertise-peer-urls=https://172.16.60.231:2380 \
  --initial-cluster=k8s-etcd01=https://172.16.60.231:2380,k8s-etcd02=https://172.16.60.232:2380,k8s-etcd03=https://192.168.137.233:2380 \
  --data-dir=/data/k8s/etcd/data \
  --wal-dir=/data/k8s/etcd/wal \
  snapshot restore /data/etcd_backup_dir/etcd-snapshot-20200820.db

# Node 172.16.60.232 (similar command with --name=k8s-etcd02 and its IP)
# Node 192.168.137.233 (similar command with --name=k8s-etcd03)</code>

Start etcd services on all nodes and verify health:

<code># systemctl start etcd
# ETCDCTL_API=3 etcdctl --endpoints="https://172.16.60.231:2379,https://172.16.60.232:2379,https://172.16.60.233:2379" \
  --cert=/etc/etcd/cert/etcd.pem \
  --key=/etc/etcd/cert/etcd-key.pem \
  --cacert=/etc/kubernetes/cert/ca.pem endpoint health</code>

Start the kube‑apiserver services and confirm the cluster is healthy:

<code># systemctl start kube-apiserver
# kubectl get cs</code>

After restoration, pods gradually return to the Running state, indicating the Kubernetes cluster has been fully recovered.

3. Final Summary

Kubernetes backup focuses on etcd. The restoration order is:

Stop kube‑apiserver

Stop etcd

Restore etcd data

Start etcd

Start kube‑apiserver

Key points:

Only one etcd node needs to be backed up; the snapshot can be synchronized to other nodes.

Restoring from a single node’s snapshot is sufficient for the whole cluster.

cloud nativeKubernetesBackupETCDrestore
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.