Cloud Native 9 min read

Why Skipping Backups Makes Kubernetes Operations Impossible

The article explains that running production Kubernetes clusters without regular backup and recovery plans exposes businesses to severe risks such as cluster failures, data loss, and prolonged downtime, and it details practical etcd physical and Velero logical backup strategies to mitigate these threats.

ITPUB
ITPUB
ITPUB
Why Skipping Backups Makes Kubernetes Operations Impossible

Kubernetes is the dominant container platform, and production clusters face operational risks such as cluster anomalies, accidental deletions, and etcd data corruption, which can cause service interruption or data loss.

Backup solution selection

etcd physical backup

Applicable scenario: cluster‑level failures (e.g., etcd crash, whole‑cluster outage).

Advantages: very fast backup/restore (minutes), suitable for emergency recovery.

Disadvantages: can only restore the entire cluster, not individual applications.

Velero logical backup

Applicable scenario: business‑level failures (e.g., accidental namespace deletion, application faults).

Advantages: can back up and restore specific namespaces or resources; backup files are editable.

Disadvantages: slower backup/restore and cannot recover etcd‑level cluster configuration.

Best practice : combine both methods – hourly etcd physical backups as a baseline protection and daily Velero logical backups as a fine‑grained protection to cover all failure scenarios.

Etcd Physical Backup

Operations required during cluster inspection.

Before upgrading the cluster or modifying core components, trigger a manual backup to prevent mistakes.

Create a backup script that runs automatically, retains backups for seven days, and prevents disk exhaustion.

#!/usr/bin/env bash
set -e  # exit on error
# Configuration paths (no need to modify)
ETCD_CA_CERT="/etc/kubernetes/pki/etcd/ca.crt"
ETCD_CERT="/etc/kubernetes/pki/etcd/server.crt"
ETCD_KEY="/etc/kubernetes/pki/etcd/server.key"
BACKUP_DIR="/opt/etcd_backup"  # backup storage path

# Create backup directory if it does not exist
[ ! -d "${BACKUP_DIR}" ] && mkdir -p ${BACKUP_DIR}

# Delete backups older than 7 days
find ${BACKUP_DIR} -name "*.db" -mtime +7 -exec rm -f {} \;

# Execute etcd snapshot backup
ETCDCTL_API=3 /usr/local/bin/etcdctl --endpoints=https://127.0.0.1:2379 \
  --cacert="${ETCD_CA_CERT}" --cert="${ETCD_CERT}" --key="${ETCD_KEY}" \
  snapshot save "${BACKUP_DIR}/etcd-snapshot-$(date +%Y%m%d.%H%M%S).db"

Set up a cron job to run the script hourly.

# Edit crontab
crontab -e
# Add the following line to execute the backup script every hour
0 */1 * * * /bin/bash /opt/etcd_backup.sh >> /opt/etcd_backup.log 2>&1

Etcd Restore

etcd data corruption or a completely unavailable cluster.

The restore overwrites all current etcd data, ensuring the latest recovery point.

Prepare for restore: stop kube-apiserver and etcd services and clear old data.

Execute the restore command:

ETCDCTL_API=3 /usr/local/bin/etcdctl snapshot restore /tmp/etcd-snapshot-xxx.db \
  --name etcd1 \
  --initial-cluster "etcd1=https://<ip>:2380,etcd2=https://<ip>:2380,etcd3=https://<ip>:2380" \
  --data-dir=/var/lib/etcd

Verify the restore:

ETCDCTL_API=3 /usr/local/bin/etcdctl endpoint health

Velero Logical Backup

Accidental deletion of a business namespace (e.g., prod).

Backing up the current environment before a release to enable rollback.

Cross‑cluster migration of workloads.

Install Velero:

wget https://github.com/vmware-tanzu/velero/releases/download/v1.8.1/velero-v1.8.1-linux-amd64.tar.gz
tar -xvf velero-v1.8.1-linux-amd64.tar.gz
cp velero-v1.8.1-linux-amd64/velero /usr/bin/
chmod +x /usr/bin/velero
velero version

Create a scheduled backup task, e.g., daily backup of the prod namespace:

velero create schedule prod-daily-backup \
  --schedule="0 1 * * *" \
  --include-namespaces=prod \
  --ttl=168h

Velero Restore

Accidental deletion or corruption of a namespace or its resources.

Rollback after a failed release.

Restore an entire namespace (common case):

velero restore create --from-backup prod-daily-backup-xxx

Restore specific resources, e.g., only a Deployment in the prod namespace:

velero restore create \
  --from-backup prod-daily-backup-xxx \
  --include-resources=deployments \
  --include-namespaces=prod

Conclusion

Etcd physical backups protect overall cluster health, while Velero logical backups address application‑level failures. Combining both provides dual protection, and regular verification ensures that data can be restored quickly when incidents occur.

cloud-nativeKubernetesBackupetcdRestoreVelero
ITPUB
Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.