Why Skipping Backups Makes Kubernetes Operations Impossible
The article explains that running production Kubernetes clusters without regular backup and recovery plans exposes businesses to severe risks such as cluster failures, data loss, and prolonged downtime, and it details practical etcd physical and Velero logical backup strategies to mitigate these threats.
Kubernetes is the dominant container platform, and production clusters face operational risks such as cluster anomalies, accidental deletions, and etcd data corruption, which can cause service interruption or data loss.
Backup solution selection
etcd physical backup
Applicable scenario: cluster‑level failures (e.g., etcd crash, whole‑cluster outage).
Advantages: very fast backup/restore (minutes), suitable for emergency recovery.
Disadvantages: can only restore the entire cluster, not individual applications.
Velero logical backup
Applicable scenario: business‑level failures (e.g., accidental namespace deletion, application faults).
Advantages: can back up and restore specific namespaces or resources; backup files are editable.
Disadvantages: slower backup/restore and cannot recover etcd‑level cluster configuration.
Best practice : combine both methods – hourly etcd physical backups as a baseline protection and daily Velero logical backups as a fine‑grained protection to cover all failure scenarios.
Etcd Physical Backup
Operations required during cluster inspection.
Before upgrading the cluster or modifying core components, trigger a manual backup to prevent mistakes.
Create a backup script that runs automatically, retains backups for seven days, and prevents disk exhaustion.
#!/usr/bin/env bash
set -e # exit on error
# Configuration paths (no need to modify)
ETCD_CA_CERT="/etc/kubernetes/pki/etcd/ca.crt"
ETCD_CERT="/etc/kubernetes/pki/etcd/server.crt"
ETCD_KEY="/etc/kubernetes/pki/etcd/server.key"
BACKUP_DIR="/opt/etcd_backup" # backup storage path
# Create backup directory if it does not exist
[ ! -d "${BACKUP_DIR}" ] && mkdir -p ${BACKUP_DIR}
# Delete backups older than 7 days
find ${BACKUP_DIR} -name "*.db" -mtime +7 -exec rm -f {} \;
# Execute etcd snapshot backup
ETCDCTL_API=3 /usr/local/bin/etcdctl --endpoints=https://127.0.0.1:2379 \
--cacert="${ETCD_CA_CERT}" --cert="${ETCD_CERT}" --key="${ETCD_KEY}" \
snapshot save "${BACKUP_DIR}/etcd-snapshot-$(date +%Y%m%d.%H%M%S).db"Set up a cron job to run the script hourly.
# Edit crontab
crontab -e
# Add the following line to execute the backup script every hour
0 */1 * * * /bin/bash /opt/etcd_backup.sh >> /opt/etcd_backup.log 2>&1Etcd Restore
etcd data corruption or a completely unavailable cluster.
The restore overwrites all current etcd data, ensuring the latest recovery point.
Prepare for restore: stop kube-apiserver and etcd services and clear old data.
Execute the restore command:
ETCDCTL_API=3 /usr/local/bin/etcdctl snapshot restore /tmp/etcd-snapshot-xxx.db \
--name etcd1 \
--initial-cluster "etcd1=https://<ip>:2380,etcd2=https://<ip>:2380,etcd3=https://<ip>:2380" \
--data-dir=/var/lib/etcdVerify the restore:
ETCDCTL_API=3 /usr/local/bin/etcdctl endpoint healthVelero Logical Backup
Accidental deletion of a business namespace (e.g., prod).
Backing up the current environment before a release to enable rollback.
Cross‑cluster migration of workloads.
Install Velero:
wget https://github.com/vmware-tanzu/velero/releases/download/v1.8.1/velero-v1.8.1-linux-amd64.tar.gz
tar -xvf velero-v1.8.1-linux-amd64.tar.gz
cp velero-v1.8.1-linux-amd64/velero /usr/bin/
chmod +x /usr/bin/velero
velero versionCreate a scheduled backup task, e.g., daily backup of the prod namespace:
velero create schedule prod-daily-backup \
--schedule="0 1 * * *" \
--include-namespaces=prod \
--ttl=168hVelero Restore
Accidental deletion or corruption of a namespace or its resources.
Rollback after a failed release.
Restore an entire namespace (common case):
velero restore create --from-backup prod-daily-backup-xxxRestore specific resources, e.g., only a Deployment in the prod namespace:
velero restore create \
--from-backup prod-daily-backup-xxx \
--include-resources=deployments \
--include-namespaces=prodConclusion
Etcd physical backups protect overall cluster health, while Velero logical backups address application‑level failures. Combining both provides dual protection, and regular verification ensures that data can be restored quickly when incidents occur.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
