Building a High‑Availability etcd Backup Solution for Kubernetes
This article explains how to design and implement a reliable etcd disaster‑recovery system for a Kubernetes‑based platform, covering backup scheduling, data synchronization across sites, code structure, and the advantages and drawbacks of the approach.
The article focuses on the etcd backup design for the Hulk virtualization team's Stark platform, which is built on Kubernetes and Docker. Since etcd stores critical cluster state, its loss would cripple the scheduler, making robust backup and disaster recovery essential.
Background
Docker is praised for its elegance, but many overlook the underlying technologies. This piece explores etcd, the key‑value store used by Kubernetes, its role in storing cluster information, container metadata, and network configuration, and shares practical experience from production use.
Solution Design
In a healthy cluster, a single leader holds the latest data. The system periodically creates a tarball of the leader’s data, pushes it to a storage middleware, which then replicates the backup to another data center, retaining the most recent two days of data.
The backup architecture treats all nodes equally; the cluster continues serving requests while backups run, and automatic failover occurs if a node fails.
To prevent overlapping backups, a file lock (flock) is employed.
*/5 * * * * flock -xn /root/etcd_DR/bin/../conf/mytest.lock -c 'python /root/etcd_DR/bin/rsync_remote_backup.py --hostlist /root/etcd_DR/bin/../conf/hostlist.cnf >> /root/etcd_DR/bin/../log/testloThe official simple backup command is also provided:
etcdctl backup --help
NAME:
etcdctl backup - backup an etcd directory
USAGE:
etcdctl backup [command options]
OPTIONS:
--data-dir Path to the etcd data dir
--wal-dir Path to the etcd wal dir
--backup-dir Path to the backup dir
--backup-wal-dir Path to the backup wal dirProject Structure
[root@k0149v ~]# tree etcd_DR/etcd_DR/
├── bin
│ ├── etcd
│ ├── init
│ └── rsync_remote_backup.py
├── changelist
├── conf
│ ├── hostlist.cnf
│ ├── members
│ ├── mytest.lock
│ └── option.spt
├── etcd_DR.sh
├── log
│ ├── etcdlog_-t_start_2016-08-30_12:31:00.log
│ └── testlog
├── README
└── status.log
3 directories, 13 files/bin/init is the environment initialization script executed before the cron‑triggered rsync_remote_backup.py.
etcd_DR.sh invokes /bin/etcd to add, modify, or delete crontab entries, controlling the system start/stop.
/bin/rsync_remote_backup.py is the core component that checks node health, uses etcdctl backup to package data, and pushes it to the storage middleware, retaining only the latest three days.
status.log records system status codes (e.g., ONGOING=2, STOP=3, INTERACTION=7) for troubleshooting.
Advantages
High switch‑over success rate: backup components run on every etcd node, eliminating single‑point failures during failover.
Fast switch‑over: new leader election occurs in milliseconds.
Real‑time data: backup data stays synchronized with the cluster.
Low data loss: as long as any node remains alive, the backup system can retrieve data.
Drawbacks
Potential deadlock from the flock file lock during transfer.
Real‑time sync demands a stable network; however, the current environment meets this requirement.
Conclusion
Cluster failure recovery: If the entire cluster becomes unusable, rebuild it by restoring the latest backup to any node using the --force-new-cluster flag, which resets the cluster ID and member list, then manually add remaining nodes.
Handling backup conflicts: Include timestamps in backup filenames and record the backup location for troubleshooting. Example backup file:
-rw-r--r--. 1 root root 315075 Aug 30 17:15 2016-August-30_17-15-01_k0608v.add.bjyt.qihoo.net.tar.gzThe solution has been running continuously on the Stark platform with healthy backup status, and ongoing monitoring will continue.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
360 Zhihui Cloud Developer
360 Zhihui Cloud is an enterprise open service platform that aims to "aggregate data value and empower an intelligent future," leveraging 360's extensive product and technology resources to deliver platform services to customers.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
