Operations 7 min read

Common ETCD Issues and Recovery Procedures

This guide explains ETCD’s high‑availability architecture and provides detailed step‑by‑step recovery procedures for single‑node failures, majority‑node outages, and database‑space‑exceeded errors, including status checks, member removal and addition, snapshot restoration, compaction, defragmentation, and alarm clearing.

360 Tech Engineering
360 Tech Engineering
360 Tech Engineering
Common ETCD Issues and Recovery Procedures

ETCD is a highly available distributed key/value store that uses the Raft algorithm for leader election and state consistency.

The article outlines common operational issues and step‑by‑step recovery procedures for three scenarios: a single node failure, a majority of nodes down, and a “database space exceeded” error.

1. Recovering from a single node failure

Check cluster status and remove the faulty member, then re‑add the node and restart the service.

Check status: etcdctl endpoint status

Remove member: etcdctl member remove $ID

Add member: etcdctl member add $name --peer-urls=https://x.x.x.x:2380

Delete old data directory, set ETCD_INITIAL_CLUSTER_STATE="existing" , and start: systemctl start etcd

Verify with etcdctl endpoint status

2. Recovering when more than half of the nodes are down

Restore the cluster from a snapshot using etcdctl snapshot restore , adjust permissions, and restart the service.

Restore command: etcdctl --name=x.x.x.x-name-3 --endpoints="https://x.x.x.x:2379" --cert=/var/lib/etcd/cert/etcd-client.pem --key=/var/lib/etcd/cert/etcd-client-key.pem --cacert=/var/lib/etcd/cert/ca.pem --initial-cluster-token=xxxxxxxxxx --initial-advertise-peer-urls=https://x.x.x.x:2380 --initial-cluster=x.x.x.x-name-1=https://x.x.x.x:2380,x.x.x.x-name-2=https://x.x.x.x:2380,x.x.x.x-name-3=https://x.x.x.x:2380 --data-dir=/var/lib/etcd/data.etcd/ snapshot restore snapshot.db

Set ownership: chown -R etcd:etcd data.etcd/

Start service: systemctl start etcd

Check members: etcdctl member list

3. Resolving “database space exceeded” errors

Backup the data, obtain the current revision, compact the store, defragment, and clear the alarm.

Backup: etcdctl snapshot save

Get revision: etcdctl --write-out="json" --cacert /var/lib/etcd/cert/ca.pem --key /var/lib/etcd/cert/etcd-client-key.pem --cert /var/lib/etcd/cert/etcd-client.pem --endpoints='*.*.*.*:2379' endpoint status | grep -o '"revision":[0-9]*'

Compact: etcdctl --cacert /var/lib/etcd/cert/ca.pem --key /var/lib/etcd/cert/etcd-client-key.pem --cert /var/lib/etcd/cert/etcd-client.pem --endpoints='*.*.*.*:2379' compact $revision

Defragment: etcdctl --cacert /var/lib/etcd/cert/ca.pem --key /var/lib/etcd/cert/etcd-client-key.pem --cert /var/lib/etcd/cert/etcd-client.pem --endpoints='*.*.*.*:2379' defrag

Clear alarm: etcdctl --write-out="table" --cacert /var/lib/etcd/cert/ca.pem --key /var/lib/etcd/cert/etcd-client-key.pem --cert /var/lib/etcd/cert/etcd-client.pem --endpoints='*.*.*.*:2379' alarm disarm

The article recommends regular backups, data compaction, and monitoring integration with Prometheus to maintain cluster stability.

distributed systemsoperationsBackupETCDRecovery
360 Tech Engineering
Written by

360 Tech Engineering

Official tech channel of 360, building the most professional technology aggregation platform for the brand.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.