How to Safely Backup and Restore etcd in a Kubernetes Cluster
This guide explains why etcd is critical for Kubernetes, walks through creating snapshots with etcdctl, automating backups via scripts and cron, and details step‑by‑step procedures for restoring a failed etcd cluster, including stopping services, cleaning data directories, and restarting components to recover the whole cluster.
etcd is a critical component of a Kubernetes cluster, storing all cluster state such as namespaces, pods, services, and routing information. Loss of etcd data can prevent cluster recovery, so backing up etcd is essential for disaster recovery.
1. etcd Cluster Backup
Different etcd versions use slightly different etcdctl commands, but the snapshot save operation is common. The backup can be performed on any single etcd node.
Execute the backup on one etcd node.
Since Kubernetes 1.13 only supports etcd v3, the backup only includes v3 data.
The example uses a binary‑deployed Kubernetes v1.18.6 with Calico.
1) View etcd data directories
<code>etcd data directory:
[root@k8s-master01 ~]# cat /opt/k8s/bin/environment.sh | grep "ETCD_DATA_DIR="
export ETCD_DATA_DIR="/data/k8s/etcd/data"
etcd WAL directory:
[root@k8s-master01 ~]# cat /opt/k8s/bin/environment.sh | grep "ETCD_WAL_DIR="
export ETCD_WAL_DIR="/data/k8s/etcd/wal"
# list directories
[root@k8s-master01 ~]# ls /data/k8s/etcd/data/
member
[root@k8s-master01 ~]# ls /data/k8s/etcd/data/member/
snap
[root@k8s-master01 ~]# ls /data/k8s/etcd/wal/
0000000000000000-0000000000000000.wal 0.tmp</code>2) Perform the snapshot backup
<code># mkdir -p /data/etcd_backup_dir
ETCDCTL_API=3 etcdctl --cacert=/etc/kubernetes/cert/ca.pem \
--cert=/etc/etcd/cert/etcd.pem \
--key=/etc/etcd/cert/etcd-key.pem \
--endpoints=https://172.16.60.231:2379 \
snapshot save /data/etcd_backup_dir/etcd-snapshot-`date +%Y%m%d`.db</code>Copy the snapshot to the other etcd nodes:
<code># rsync -e "ssh -p22" -avpgolr /data/etcd_backup_dir/etcd-snapshot-20200820.db root@k8s-master02:/data/etcd_backup_dir/
# rsync -e "ssh -p22" -avpgolr /data/etcd_backup_dir/etcd-snapshot-20200820.db root@k8s-master03:/data/etcd_backup_dir/</code>Schedule the backup with cron (example runs daily at 5 AM):
<code># chmod 755 /data/etcd_backup_dir/etcd_backup.sh
# crontab -l
0 5 * * * /bin/bash -x /data/etcd_backup_dir/etcd_backup.sh > /dev/null 2>&1</code>2. etcd Cluster Restore
Backup can be taken from a single node, but restoration must be performed on every etcd node.
Simulate data loss
<code># rm -rf /data/k8s/etcd/data/*</code>Check cluster health (initially unhealthy, then recovers after etcd restarts):
<code># kubectl get cs
# ETCDCTL_API=3 etcdctl --endpoints="https://172.16.60.231:2379,https://172.16.60.232:2379,https://172.16.60.233:2379" \
--cert=/etc/etcd/cert/etcd.pem \
--key=/etc/etcd/cert/etcd-key.pem \
--cacert=/etc/kubernetes/cert/ca.pem endpoint health</code>Stop all master kube‑apiserver services and etcd services before restoring:
<code># systemctl stop kube-apiserver
# systemctl stop etcd</code>Delete old data and WAL directories on each node (otherwise restore will fail):
<code># rm -rf /data/k8s/etcd/data && rm -rf /data/k8s/etcd/wal</code>Run the restore command on each node (replace the IP address accordingly):
<code># Node 172.16.60.231
ETCDCTL_API=3 etcdctl \
--name=k8s-etcd01 \
--endpoints="https://172.16.60.231:2379" \
--cert=/etc/etcd/cert/etcd.pem \
--key=/etc/etcd/cert/etcd-key.pem \
--cacert=/etc/kubernetes/cert/ca.pem \
--initial-cluster-token=etcd-cluster-0 \
--initial-advertise-peer-urls=https://172.16.60.231:2380 \
--initial-cluster=k8s-etcd01=https://172.16.60.231:2380,k8s-etcd02=https://172.16.60.232:2380,k8s-etcd03=https://192.168.137.233:2380 \
--data-dir=/data/k8s/etcd/data \
--wal-dir=/data/k8s/etcd/wal \
snapshot restore /data/etcd_backup_dir/etcd-snapshot-20200820.db
# Node 172.16.60.232 (similar command with --name=k8s-etcd02 and its IP)
# Node 192.168.137.233 (similar command with --name=k8s-etcd03)</code>Start etcd services on all nodes and verify health:
<code># systemctl start etcd
# ETCDCTL_API=3 etcdctl --endpoints="https://172.16.60.231:2379,https://172.16.60.232:2379,https://172.16.60.233:2379" \
--cert=/etc/etcd/cert/etcd.pem \
--key=/etc/etcd/cert/etcd-key.pem \
--cacert=/etc/kubernetes/cert/ca.pem endpoint health</code>Start the kube‑apiserver services and confirm the cluster is healthy:
<code># systemctl start kube-apiserver
# kubectl get cs</code>After restoration, pods gradually return to the Running state, indicating the Kubernetes cluster has been fully recovered.
3. Final Summary
Kubernetes backup focuses on etcd. The restoration order is:
Stop kube‑apiserver
Stop etcd
Restore etcd data
Start etcd
Start kube‑apiserver
Key points:
Only one etcd node needs to be backed up; the snapshot can be synchronized to other nodes.
Restoring from a single node’s snapshot is sufficient for the whole cluster.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.