etcd Operations Handbook: Backup, Restore, Scaling, and Performance Tuning for Kubernetes
This guide explains why mastering etcd is essential for Kubernetes stability and walks through its core concepts, Raft consensus, MVCC storage, deployment, backup and restore procedures, scaling from three to five nodes, performance optimization, monitoring, alerting, troubleshooting, upgrade strategies, security hardening, and real‑world best‑practice recommendations.
Why etcd matters
etcd stores the entire state of a Kubernetes cluster. If etcd fails, the control plane cannot list pods, create new resources, or update existing ones, leading to missing kubectl get pod output, API server errors, and potential cluster‑wide outages.
Core concepts
etcd is a distributed, highly available KV store that provides strong consistency via the Raft consensus algorithm. Its main features include:
Strong consistency – all nodes agree on the same data.
Watch mechanism – clients can watch keys or key ranges for changes.
Transactional support – MVCC (multi‑version concurrency control) enables atomic compare‑and‑swap operations.
Leases – keys can have TTLs.
Small data size – typical values are under 1.5 MiB.
Raft consensus
Raft solves consistency by breaking the problem into three sub‑problems: leader election, log replication, and safety. Each term has a single leader; followers replicate log entries; committed entries are never rolled back.
# Leader election example
# 1. All nodes start as Followers
# 2. If a follower does not receive a heartbeat within ElectionTimeout (default 1 s), it becomes a Candidate
# 3. The candidate votes for itself and requests votes from other nodes
# 4. If it receives a majority, it becomes LeaderDeployment from scratch
The article provides a step‑by‑step binary deployment for a three‑node etcd cluster, including hardware checks, OS tuning, network parameters, and systemd unit configuration.
# Example systemd unit (/etc/systemd/system/etcd.service)
[Unit]
Description=etcd
After=network-online.target local-fs.target
Wants=network-online.target
[Service]
Type=notify
User=etcd
ExecStart=/usr/local/bin/etcd --config-file /etc/etcd/etcd.conf.yml
Restart=on-failure
RestartSec=10s
LimitNOFILE=65536
OOMScoreAdjust=-999
[Install]
WantedBy=multi-user.targetKey configuration items include listen-client-urls, advertise-client-urls, listen-peer-urls, initial-cluster, TLS certificates, quota-backend-bytes (8 GiB), and max-request-bytes (10 MiB).
Backup and restore
Three backup strategies are compared: simple etcdctl snapshot, scheduled snapshots with remote storage, and continuous S3 backup. The recommended production approach is scheduled snapshots because they are automated and recoverable.
# Manual snapshot
etcdctl snapshot save /backup/etcd-snapshot-$(date +%Y%m%d-%H%M%S).db \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.keySnapshot verification uses etcdctl snapshot status and shows hash, revision, total keys, and size.
# Verify snapshot
etcdctl snapshot status /backup/etcd-snapshot-20260610-100000.db \
--write-out=table
+----------+----------+------------+------------+
| HASH | REVISION | TOTAL KEY | TOTAL SIZE |
+----------+----------+------------+------------+
| c3f2a8b9 | 12345678 | 8421 | 268435456 |
+----------+----------+------------+------------+Restoration can be performed for a single node or the whole cluster. The example restores a single node:
etcdctl snapshot restore /backup/etcd-snapshot-20260610-030000.db \
--name=etcd-0 \
--initial-cluster=etcd-0=https://10.0.0.11:2380,etcd-1=https://10.0.0.12:2380,etcd-2=https://10.0.0.13:2380 \
--initial-advertise-peer-urls=https://10.0.0.11:2380 \
--data-dir=/var/lib/etcdScaling (3 → 5 nodes)
Scaling adds a new node with etcdctl member add, updates initial‑cluster on existing nodes, and restarts them one by one to avoid loss of quorum.
# Add new member on an existing node
etcdctl member add etcd-3 \
--peer-urls=https://10.0.0.14:2380 \
--endpoints=https://10.0.0.11:2379 \
--cacert=/etc/etcd/ssl/ca.pem \
--cert=/etc/etcd/ssl/etcd.pem \
--key=/etc/etcd/ssl/etcd-key.pemAfter updating initial‑cluster on all nodes, each node is restarted, waiting for the cluster to become healthy before proceeding to the next node.
Performance tuning
Key performance bottlenecks are disk I/O, network latency, and CPU contention. Recommended system settings include disabling swap, turning off transparent hugepages, using the noop I/O scheduler on NVMe, and tuning kernel network parameters (e.g., net.core.rmem_max, net.core.wmem_max, BBR congestion control).
# Example sysctl tuning for network
net.core.rmem_max=16777216
net.core.wmem_max=16777216
net.ipv4.tcp_rmem=4096 87380 16777216
net.ipv4.tcp_wmem=4096 65536 16777216
net.core.somaxconn=32768
net.ipv4.tcp_max_syn_backlog=8192
net.ipv4.tcp_congestion_control=bbretcd‑specific parameters such as election-timeout (default 1000 ms) and heartbeat-interval (default 100 ms) can be increased to improve stability under high load.
Compaction and defragmentation
Because MVCC stores every revision, regular compaction removes old versions, and defrag reclaims physical space. The article provides a combined script that compacts to the latest revision and then runs etcdctl defrag, logging the operation to syslog.
#!/usr/bin/env bash
set -euo pipefail
ENDPOINT="https://127.0.0.1:2379"
CACERT="/etc/kubernetes/pki/etcd/ca.crt"
CERT="/etc/kubernetes/pki/etcd/server.crt"
KEY="/etc/kubernetes/pki/etcd/server.key"
# Get latest revision
REV=$(etcdctl endpoint status --endpoints="$ENDPOINT" \
--cacert="$CACERT" --cert="$CERT" --key="$KEY" -w json \
| jq -r '.[0].Status."raftAppliedIndex"')
# Compact
etcdctl compact $REV \
--endpoints="$ENDPOINT" --cacert="$CACERT" --cert="$CERT" --key="$KEY"
# Defrag
etcdctl defrag --endpoints="$ENDPOINT" --cacert="$CACERT" --cert="$CERT" --key="$KEY"
logger -t etcd-compact "compact and defrag completed at rev $REV"Monitoring and alerting
etcd exposes extensive Prometheus metrics. Critical health metrics include etcd_server_has_leader, etcd_server_leader_changes_seen_total, and write‑latency metrics such as etcd_disk_wal_fsync_duration_seconds. A sample alerting rule set is provided for leader loss, frequent leader changes, high proposal failure rate, slow WAL fsync, high network RTT, DB size approaching quota, and backup failures.
groups:
- name: etcd
rules:
- alert: EtcdNoLeader
expr: sum(etcd_server_has_leader) == 0
for: 1m
labels:
severity: critical
annotations:
summary: "etcd cluster has no leader"
- alert: EtcdHighNumberOfLeaderChanges
expr: increase(etcd_server_leader_changes_seen_total[1h]) > 5
for: 0m
labels:
severity: warning
annotations:
summary: "etcd leader changes frequently"
- alert: EtcdHighFsyncDuration
expr: histogram_quantile(0.99, sum(rate(etcd_disk_wal_fsync_duration_seconds_bucket[5m])) by (le,instance)) > 0.01
for: 5m
labels:
severity: warning
annotations:
summary: "etcd WAL fsync P99 > 10ms"Security hardening
Enable mutual TLS for both client‑server and peer communication, configure RBAC (root user, read‑only role), enforce network isolation with firewall rules, and optionally encrypt backups with GPG. The article also mentions audit logging via the --logger=zap flag.
Common troubleshooting cases
Seven real‑world cases are described, each with symptoms, diagnostic steps, root cause, and remediation. Examples include frequent elections caused by disk I/O contention, quota exhaustion due to large ConfigMaps, split‑brain from network partitions, startup failures from permission or certificate issues, and upgrade problems when moving from etcd 3.4 to 3.5 without a snapshot‑restore.
Upgrade and maintenance
Upgrade strategy follows a rolling approach: backup, test in a staging environment, upgrade one node at a time, wait for leader election stability (≈30 s), verify health metrics, run defrag and compact, and monitor for at least one hour. Version paths (3.3 → 3.4, 3.4 → 3.5) are compatible; direct upgrades across major versions require a snapshot‑restore.
FAQ
Recommended cluster size: 3 nodes (tolerates 1 failure) or 5 nodes (tolerates 2 failures).
Cross‑AZ deployment requires RTT < 50 ms and a majority in the same AZ.
Database size should stay below 8 GiB; > 4 GiB may need compaction.
etcd does not support horizontal sharding natively; use API server sharding or external tools for large scales.
Key count should stay under 1 M for good performance.
K8s Secrets are only base64‑encoded; use external secret managers or encryption for true secrecy.
Real‑world case study
A financial services company migrated from a single‑SSD three‑node setup to dedicated NVMe nodes, upgraded to etcd 3.5.13, introduced RBAC, automated backup to S3, and added comprehensive Prometheus alerts. After the migration, leader changes dropped to zero, write‑latency P99 fell from 50 ms to 5 ms, and monthly disaster‑recovery drills became routine.
Maintenance checklist
Daily: health check, backup verification, silence non‑critical alerts.
Weekly: restore test, metric trend review, certificate expiry audit.
Monthly: version upgrade review, capacity planning, alert rule refinement.
Quarterly: full disaster‑recovery drill, performance stress test, documentation update.
Appendix
Quick reference of common etcdctl commands.
Table of important configuration parameters.
Complete Prometheus alert rule set and recommended Grafana dashboards (etcd‑dashboard, kube‑etcd).
Further reading: official etcd docs, Kubernetes etcd ops guide, and books such as "Designing Data‑Intensive Applications".
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Ops Community
A leading IT operations community where professionals share and grow together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
