Cloud Native 7 min read

Building a High‑Availability etcd Backup Solution for Kubernetes

This article explains how to design and implement a reliable etcd disaster‑recovery system for a Kubernetes‑based platform, covering backup scheduling, data synchronization across sites, code structure, and the advantages and drawbacks of the approach.

360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
Building a High‑Availability etcd Backup Solution for Kubernetes

The article focuses on the etcd backup design for the Hulk virtualization team's Stark platform, which is built on Kubernetes and Docker. Since etcd stores critical cluster state, its loss would cripple the scheduler, making robust backup and disaster recovery essential.

Background

Docker is praised for its elegance, but many overlook the underlying technologies. This piece explores etcd, the key‑value store used by Kubernetes, its role in storing cluster information, container metadata, and network configuration, and shares practical experience from production use.

Solution Design

In a healthy cluster, a single leader holds the latest data. The system periodically creates a tarball of the leader’s data, pushes it to a storage middleware, which then replicates the backup to another data center, retaining the most recent two days of data.

The backup architecture treats all nodes equally; the cluster continues serving requests while backups run, and automatic failover occurs if a node fails.

To prevent overlapping backups, a file lock (flock) is employed.

*/5 * * * * flock -xn /root/etcd_DR/bin/../conf/mytest.lock -c 'python /root/etcd_DR/bin/rsync_remote_backup.py --hostlist /root/etcd_DR/bin/../conf/hostlist.cnf >> /root/etcd_DR/bin/../log/testlo

The official simple backup command is also provided:

etcdctl backup --help
NAME:
  etcdctl backup - backup an etcd directory
USAGE:
  etcdctl backup [command options]
OPTIONS:
  --data-dir        Path to the etcd data dir
  --wal-dir         Path to the etcd wal dir
  --backup-dir      Path to the backup dir
  --backup-wal-dir  Path to the backup wal dir

Project Structure

[root@k0149v ~]# tree etcd_DR/etcd_DR/
├── bin
│   ├── etcd
│   ├── init
│   └── rsync_remote_backup.py
├── changelist
├── conf
│   ├── hostlist.cnf
│   ├── members
│   ├── mytest.lock
│   └── option.spt
├── etcd_DR.sh
├── log
│   ├── etcdlog_-t_start_2016-08-30_12:31:00.log
│   └── testlog
├── README
└── status.log
3 directories, 13 files

/bin/init is the environment initialization script executed before the cron‑triggered rsync_remote_backup.py.

etcd_DR.sh invokes /bin/etcd to add, modify, or delete crontab entries, controlling the system start/stop.

/bin/rsync_remote_backup.py is the core component that checks node health, uses etcdctl backup to package data, and pushes it to the storage middleware, retaining only the latest three days.

status.log records system status codes (e.g., ONGOING=2, STOP=3, INTERACTION=7) for troubleshooting.

Advantages

High switch‑over success rate: backup components run on every etcd node, eliminating single‑point failures during failover.

Fast switch‑over: new leader election occurs in milliseconds.

Real‑time data: backup data stays synchronized with the cluster.

Low data loss: as long as any node remains alive, the backup system can retrieve data.

Drawbacks

Potential deadlock from the flock file lock during transfer.

Real‑time sync demands a stable network; however, the current environment meets this requirement.

Conclusion

Cluster failure recovery: If the entire cluster becomes unusable, rebuild it by restoring the latest backup to any node using the --force-new-cluster flag, which resets the cluster ID and member list, then manually add remaining nodes.

Handling backup conflicts: Include timestamps in backup filenames and record the backup location for troubleshooting. Example backup file:

-rw-r--r--. 1 root root 315075 Aug 30 17:15 2016-August-30_17-15-01_k0608v.add.bjyt.qihoo.net.tar.gz

The solution has been running continuously on the Stark platform with healthy backup status, and ongoing monitoring will continue.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Cloud NativeDockerKubernetesdisaster recoveryBackupetcd
360 Zhihui Cloud Developer
Written by

360 Zhihui Cloud Developer

360 Zhihui Cloud is an enterprise open service platform that aims to "aggregate data value and empower an intelligent future," leveraging 360's extensive product and technology resources to deliver platform services to customers.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.