Operations 7 min read

Common ETCD Issues and Recovery Procedures

This guide explains ETCD’s high‑availability architecture and provides detailed step‑by‑step recovery procedures for single‑node failures, majority‑node outages, and database‑space‑exceeded errors, including status checks, member removal and addition, snapshot restoration, compaction, defragmentation, and alarm clearing.

360 Tech Engineering

Jul 8, 2019

Common ETCD Issues and Recovery Procedures

ETCD is a highly available distributed key/value store that uses the Raft algorithm for leader election and state consistency.

The article outlines common operational issues and step‑by‑step recovery procedures for three scenarios: a single node failure, a majority of nodes down, and a “database space exceeded” error.

1. Recovering from a single node failure

Check cluster status and remove the faulty member, then re‑add the node and restart the service.

Check status: etcdctl endpoint status Remove member: etcdctl member remove $ID Add member: etcdctl member add $name --peer-urls=https://x.x.x.x:2380 Delete old data directory, set ETCD_INITIAL_CLUSTER_STATE="existing", and start: systemctl start etcd Verify with etcdctl endpoint status 2. Recovering when more than half of the nodes are down

Restore the cluster from a snapshot using etcdctl snapshot restore, adjust permissions, and restart the service.

Restore command:

etcdctl --name=x.x.x.x-name-3 --endpoints="https://x.x.x.x:2379" --cert=/var/lib/etcd/cert/etcd-client.pem --key=/var/lib/etcd/cert/etcd-client-key.pem --cacert=/var/lib/etcd/cert/ca.pem --initial-cluster-token=xxxxxxxxxx --initial-advertise-peer-urls=https://x.x.x.x:2380 --initial-cluster=x.x.x.x-name-1=https://x.x.x.x:2380,x.x.x.x-name-2=https://x.x.x.x:2380,x.x.x.x-name-3=https://x.x.x.x:2380 --data-dir=/var/lib/etcd/data.etcd/ snapshot restore snapshot.db

Set ownership: chown -R etcd:etcd data.etcd/ Start service: systemctl start etcd Check members: etcdctl member list 3. Resolving “database space exceeded” errors

Backup the data, obtain the current revision, compact the store, defragment, and clear the alarm.

Backup: etcdctl snapshot save Get revision:

etcdctl --write-out="json" --cacert /var/lib/etcd/cert/ca.pem --key /var/lib/etcd/cert/etcd-client-key.pem --cert /var/lib/etcd/cert/etcd-client.pem --endpoints='*.*.*.*:2379' endpoint status | grep -o '"revision":[0-9]*'

Compact:

etcdctl --cacert /var/lib/etcd/cert/ca.pem --key /var/lib/etcd/cert/etcd-client-key.pem --cert /var/lib/etcd/cert/etcd-client.pem --endpoints='*.*.*.*:2379' compact $revision

Defragment:

etcdctl --cacert /var/lib/etcd/cert/ca.pem --key /var/lib/etcd/cert/etcd-client-key.pem --cert /var/lib/etcd/cert/etcd-client.pem --endpoints='*.*.*.*:2379' defrag

Clear alarm:

etcdctl --write-out="table" --cacert /var/lib/etcd/cert/ca.pem --key /var/lib/etcd/cert/etcd-client-key.pem --cert /var/lib/etcd/cert/etcd-client.pem --endpoints='*.*.*.*:2379' alarm disarm

The article recommends regular backups, data compaction, and monitoring integration with Prometheus to maintain cluster stability.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Operations backup Etcd Recovery

Written by

360 Tech Engineering

Official tech channel of 360, building the most professional technology aggregation platform for the brand.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.