Common ETCD Issues and Recovery Procedures
This guide explains ETCD’s high‑availability architecture and provides detailed step‑by‑step recovery procedures for single‑node failures, majority‑node outages, and database‑space‑exceeded errors, including status checks, member removal and addition, snapshot restoration, compaction, defragmentation, and alarm clearing.
ETCD is a highly available distributed key/value store that uses the Raft algorithm for leader election and state consistency.
The article outlines common operational issues and step‑by‑step recovery procedures for three scenarios: a single node failure, a majority of nodes down, and a “database space exceeded” error.
1. Recovering from a single node failure
Check cluster status and remove the faulty member, then re‑add the node and restart the service.
Check status: etcdctl endpoint status
Remove member: etcdctl member remove $ID
Add member: etcdctl member add $name --peer-urls=https://x.x.x.x:2380
Delete old data directory, set ETCD_INITIAL_CLUSTER_STATE="existing" , and start: systemctl start etcd
Verify with etcdctl endpoint status
2. Recovering when more than half of the nodes are down
Restore the cluster from a snapshot using etcdctl snapshot restore , adjust permissions, and restart the service.
Restore command: etcdctl --name=x.x.x.x-name-3 --endpoints="https://x.x.x.x:2379" --cert=/var/lib/etcd/cert/etcd-client.pem --key=/var/lib/etcd/cert/etcd-client-key.pem --cacert=/var/lib/etcd/cert/ca.pem --initial-cluster-token=xxxxxxxxxx --initial-advertise-peer-urls=https://x.x.x.x:2380 --initial-cluster=x.x.x.x-name-1=https://x.x.x.x:2380,x.x.x.x-name-2=https://x.x.x.x:2380,x.x.x.x-name-3=https://x.x.x.x:2380 --data-dir=/var/lib/etcd/data.etcd/ snapshot restore snapshot.db
Set ownership: chown -R etcd:etcd data.etcd/
Start service: systemctl start etcd
Check members: etcdctl member list
3. Resolving “database space exceeded” errors
Backup the data, obtain the current revision, compact the store, defragment, and clear the alarm.
Backup: etcdctl snapshot save
Get revision: etcdctl --write-out="json" --cacert /var/lib/etcd/cert/ca.pem --key /var/lib/etcd/cert/etcd-client-key.pem --cert /var/lib/etcd/cert/etcd-client.pem --endpoints='*.*.*.*:2379' endpoint status | grep -o '"revision":[0-9]*'
Compact: etcdctl --cacert /var/lib/etcd/cert/ca.pem --key /var/lib/etcd/cert/etcd-client-key.pem --cert /var/lib/etcd/cert/etcd-client.pem --endpoints='*.*.*.*:2379' compact $revision
Defragment: etcdctl --cacert /var/lib/etcd/cert/ca.pem --key /var/lib/etcd/cert/etcd-client-key.pem --cert /var/lib/etcd/cert/etcd-client.pem --endpoints='*.*.*.*:2379' defrag
Clear alarm: etcdctl --write-out="table" --cacert /var/lib/etcd/cert/ca.pem --key /var/lib/etcd/cert/etcd-client-key.pem --cert /var/lib/etcd/cert/etcd-client.pem --endpoints='*.*.*.*:2379' alarm disarm
The article recommends regular backups, data compaction, and monitoring integration with Prometheus to maintain cluster stability.
360 Tech Engineering
Official tech channel of 360, building the most professional technology aggregation platform for the brand.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.