Cloud Native 13 min read

How to Safely Backup and Restore Etcd in Kubernetes: A Step‑by‑Step Guide

This article explains why regular Etcd snapshots are essential for Kubernetes disaster recovery and provides detailed, command‑line procedures for restoring Etcd data on both single‑node and high‑availability clusters, including necessary configuration adjustments and verification steps.

Open Source Linux
Open Source Linux
Open Source Linux
How to Safely Backup and Restore Etcd in Kubernetes: A Step‑by‑Step Guide

1. Overview

In a Kubernetes cluster, all operational resource data is stored in the Etcd database. To ensure rapid recovery after node failures, cluster migrations, or other anomalies, regular disaster‑recovery backups of Etcd data are required.

Kubernetes makes Etcd backup easy: taking a snapshot on a single node captures the entire cluster state. With a snapshot, even if all control‑plane nodes are lost, the cluster can be quickly restored.

Note: Even in a highly available Etcd cluster, a backup on one node is sufficient, but it is strongly recommended to back up on all Etcd nodes and regularly copy snapshots to a dedicated storage server.

2. Practical Etcd Snapshot Restoration

2.1 Single‑Node Recovery

Description: When a single node’s resource data is lost, the following steps restore the data quickly.

Procedure:

(1) Stop the Etcd service on the node systemctl stop etcd (2) Backup the Etcd data directory mv /var/lib/etcd /var/lib/etcd.bak (3) Restore Etcd data using the snapshot file

etcdctl --cacert=/opt/etcd/ssl/ca.pem \
  --cert=/opt/etcd/ssl/server.pem \
  --key=/opt/etcd/ssl/server-key.pem \
  --endpoints 10.20.30.31:2379 \
  snapshot restore /var/backups/kube_etcd/etcd-2024-0206-snapshot.db \
  --name=etcd01 \
  --initial-cluster=etcd01=https://10.20.30.31:2380 \
  --initial-advertise-peer-urls=https://10.20.30.31:2380 \
  --data-dir=/var/lib/etcd
Note 1: The etcdctl client uses the v3 API by default. Note 2: Replace IP addresses, certificates, keys, and snapshot file paths with those of your actual cluster.

(4) Start the Etcd service systemctl start etcd (5) Verify Etcd node status

etcdctl --cacert=/opt/etcd/ssl/ca.pem \
  --cert=/opt/etcd/ssl/server.pem \
  --key=/opt/etcd/ssl/server-key.pem \
  --endpoints "https://10.20.30.31:2379" endpoint status --write-out=table

2.2 High‑Availability Cluster Recovery

Restoring a HA Etcd cluster requires restoring each node individually. The example below uses a three‑node cluster.

(1) Gather node information for the HA cluster.

(2) Install a fresh Etcd service on each node (example service file shown).

# /usr/lib/systemd/system/etcd.service
[Unit]
Description=Etcd Server
After=network.target
Wants=network-online.target

[Service]
Type=notify
EnvironmentFile=/opt/etcd/cfg/etcd.conf
ExecStart=/opt/etcd/bin/etcd
Restart=on-failure
LimitNOFILE=65536

[Install]
WantedBy=multi-user.target

Example member configuration (node 103 shown):

# [Member]
ETCD_NAME="etcd01"
ETCD_DATA_DIR="/var/lib/etcd"
ETCD_LISTEN_PEER_URLS="https://10.20.31.103:2380"
ETCD_LISTEN_CLIENT_URLS="https://10.20.31.103:2379,http://127.0.0.1:2379"

# [Clustering]
ETCD_INITIAL_ADVERTISE_PEER_URLS="https://10.20.31.103:2380"
ETCD_ADVERTISE_CLIENT_URLS="https://10.20.31.103:2379"
ETCD_INITIAL_CLUSTER="etcd01=https://10.20.31.103:2380,etcd02=https://10.20.31.104:2380,etcd03=https://10.20.31.105:2380"
ETCD_INITIAL_CLUSTER_TOKEN="etcd-cluster"
ETCD_INITIAL_CLUSTER_STATE="new"
ETCD_ENABLE_V2="true"

# [Security]
ETCD_CERT_FILE="/opt/etcd/ssl/server.pem"
ETCD_KEY_FILE="/opt/etcd/ssl/server-key.pem"
ETCD_TRUSTED_CA_FILE="/opt/etcd/ssl/ca.pem"
ETCD_CLIENT_CERT_AUTH="true"
ETCD_PEER_CERT_FILE="/opt/etcd/ssl/server.pem"
ETCD_PEER_KEY_FILE="/opt/etcd/ssl/server-key.pem"
ETCD_PEER_TRUSTED_CA_FILE="/opt/etcd/ssl/ca.pem"
ETCD_PEER_CLIENT_CERT_AUTH="true"

Verify the new cluster with etcdctl:

/opt/etcd/bin/etcdctl --cacert=/opt/etcd/ssl/ca.pem \
  --cert=/opt/etcd/ssl/server.pem \
  --key=/opt/etcd/ssl/server-key.pem \
  --endpoints "https://10.20.31.103:2379,https://10.20.31.104:2379,https://10.20.31.105:2379" \
  endpoint status --write-out=table

(2) Stop Etcd service on all nodes: systemctl stop etcd (3) Backup the Etcd data directory on each node: mv /var/lib/etcd /var/lib/etcd.bak (4) Restore each node using the snapshot file (example for node 103):

/opt/etcd/bin/etcdctl --cacert=/opt/etcd/ssl/ca.pem \
  --cert=/opt/etcd/ssl/server.pem \
  --key=/opt/etcd/ssl/server-key.pem \
  snapshot restore snapshot.db \
  --name etcd01 \
  --initial-cluster=etcd01=https://10.20.31.103:2380,etcd02=https://10.20.31.104:2380,etcd03=https://10.20.31.105:2380 \
  --initial-cluster-token=etcd-cluster \
  --initial-advertise-peer-urls=https://10.20.31.103:2380 \
  --data-dir=/var/lib/etcd

Repeat the above command for nodes 104 and 105, adjusting --name and --initial-advertise-peer-urls accordingly.

(5) Start Etcd service on all nodes: systemctl start etcd (6) Verify cluster status:

/opt/etcd/bin/etcdctl --cacert=/opt/etcd/ssl/ca.pem \
  --cert=/opt/etcd/ssl/server.pem \
  --key=/opt/etcd/ssl/server-key.pem \
  --endpoints "https://10.20.31.103:2379,https://10.20.31.104:2379,https://10.20.31.105:2379" \
  endpoint status --write-out=table

After these steps, the HA Etcd cluster is restored and ready to serve.

3. Summary

With a single snapshot file, you can restore an Etcd cluster using etcdctl snapshot restore, creating a new data directory for all nodes. The restore overwrites certain metadata (member ID, cluster ID) to prevent accidental joining of other clusters.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

cloud-nativeOperationsBackupetcdRestoreetcdctl
Open Source Linux
Written by

Open Source Linux

Focused on sharing Linux/Unix content, covering fundamentals, system development, network programming, automation/operations, cloud computing, and related professional knowledge.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.