Cloud Native 49 min read

etcd Operations Handbook: Backup, Restore, Scaling, and Performance Tuning for Kubernetes

This guide explains why mastering etcd is essential for Kubernetes stability and walks through its core concepts, Raft consensus, MVCC storage, deployment, backup and restore procedures, scaling from three to five nodes, performance optimization, monitoring, alerting, troubleshooting, upgrade strategies, security hardening, and real‑world best‑practice recommendations.

Ops Community
Ops Community
Ops Community
etcd Operations Handbook: Backup, Restore, Scaling, and Performance Tuning for Kubernetes

Why etcd matters

etcd stores the entire state of a Kubernetes cluster. If etcd fails, the control plane cannot list pods, create new resources, or update existing ones, leading to missing kubectl get pod output, API server errors, and potential cluster‑wide outages.

Core concepts

etcd is a distributed, highly available KV store that provides strong consistency via the Raft consensus algorithm. Its main features include:

Strong consistency – all nodes agree on the same data.

Watch mechanism – clients can watch keys or key ranges for changes.

Transactional support – MVCC (multi‑version concurrency control) enables atomic compare‑and‑swap operations.

Leases – keys can have TTLs.

Small data size – typical values are under 1.5 MiB.

Raft consensus

Raft solves consistency by breaking the problem into three sub‑problems: leader election, log replication, and safety. Each term has a single leader; followers replicate log entries; committed entries are never rolled back.

# Leader election example
# 1. All nodes start as Followers
# 2. If a follower does not receive a heartbeat within ElectionTimeout (default 1 s), it becomes a Candidate
# 3. The candidate votes for itself and requests votes from other nodes
# 4. If it receives a majority, it becomes Leader

Deployment from scratch

The article provides a step‑by‑step binary deployment for a three‑node etcd cluster, including hardware checks, OS tuning, network parameters, and systemd unit configuration.

# Example systemd unit (/etc/systemd/system/etcd.service)
[Unit]
Description=etcd
After=network-online.target local-fs.target
Wants=network-online.target

[Service]
Type=notify
User=etcd
ExecStart=/usr/local/bin/etcd --config-file /etc/etcd/etcd.conf.yml
Restart=on-failure
RestartSec=10s
LimitNOFILE=65536
OOMScoreAdjust=-999

[Install]
WantedBy=multi-user.target

Key configuration items include listen-client-urls, advertise-client-urls, listen-peer-urls, initial-cluster, TLS certificates, quota-backend-bytes (8 GiB), and max-request-bytes (10 MiB).

Backup and restore

Three backup strategies are compared: simple etcdctl snapshot, scheduled snapshots with remote storage, and continuous S3 backup. The recommended production approach is scheduled snapshots because they are automated and recoverable.

# Manual snapshot
etcdctl snapshot save /backup/etcd-snapshot-$(date +%Y%m%d-%H%M%S).db \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key

Snapshot verification uses etcdctl snapshot status and shows hash, revision, total keys, and size.

# Verify snapshot
etcdctl snapshot status /backup/etcd-snapshot-20260610-100000.db \
  --write-out=table
+----------+----------+------------+------------+
|   HASH   | REVISION | TOTAL KEY  | TOTAL SIZE |
+----------+----------+------------+------------+
| c3f2a8b9 | 12345678 |      8421 | 268435456 |
+----------+----------+------------+------------+

Restoration can be performed for a single node or the whole cluster. The example restores a single node:

etcdctl snapshot restore /backup/etcd-snapshot-20260610-030000.db \
  --name=etcd-0 \
  --initial-cluster=etcd-0=https://10.0.0.11:2380,etcd-1=https://10.0.0.12:2380,etcd-2=https://10.0.0.13:2380 \
  --initial-advertise-peer-urls=https://10.0.0.11:2380 \
  --data-dir=/var/lib/etcd

Scaling (3 → 5 nodes)

Scaling adds a new node with etcdctl member add, updates initial‑cluster on existing nodes, and restarts them one by one to avoid loss of quorum.

# Add new member on an existing node
etcdctl member add etcd-3 \
  --peer-urls=https://10.0.0.14:2380 \
  --endpoints=https://10.0.0.11:2379 \
  --cacert=/etc/etcd/ssl/ca.pem \
  --cert=/etc/etcd/ssl/etcd.pem \
  --key=/etc/etcd/ssl/etcd-key.pem

After updating initial‑cluster on all nodes, each node is restarted, waiting for the cluster to become healthy before proceeding to the next node.

Performance tuning

Key performance bottlenecks are disk I/O, network latency, and CPU contention. Recommended system settings include disabling swap, turning off transparent hugepages, using the noop I/O scheduler on NVMe, and tuning kernel network parameters (e.g., net.core.rmem_max, net.core.wmem_max, BBR congestion control).

# Example sysctl tuning for network
net.core.rmem_max=16777216
net.core.wmem_max=16777216
net.ipv4.tcp_rmem=4096 87380 16777216
net.ipv4.tcp_wmem=4096 65536 16777216
net.core.somaxconn=32768
net.ipv4.tcp_max_syn_backlog=8192
net.ipv4.tcp_congestion_control=bbr

etcd‑specific parameters such as election-timeout (default 1000 ms) and heartbeat-interval (default 100 ms) can be increased to improve stability under high load.

Compaction and defragmentation

Because MVCC stores every revision, regular compaction removes old versions, and defrag reclaims physical space. The article provides a combined script that compacts to the latest revision and then runs etcdctl defrag, logging the operation to syslog.

#!/usr/bin/env bash
set -euo pipefail
ENDPOINT="https://127.0.0.1:2379"
CACERT="/etc/kubernetes/pki/etcd/ca.crt"
CERT="/etc/kubernetes/pki/etcd/server.crt"
KEY="/etc/kubernetes/pki/etcd/server.key"
# Get latest revision
REV=$(etcdctl endpoint status --endpoints="$ENDPOINT" \
  --cacert="$CACERT" --cert="$CERT" --key="$KEY" -w json \
  | jq -r '.[0].Status."raftAppliedIndex"')
# Compact
etcdctl compact $REV \
  --endpoints="$ENDPOINT" --cacert="$CACERT" --cert="$CERT" --key="$KEY"
# Defrag
etcdctl defrag --endpoints="$ENDPOINT" --cacert="$CACERT" --cert="$CERT" --key="$KEY"
logger -t etcd-compact "compact and defrag completed at rev $REV"

Monitoring and alerting

etcd exposes extensive Prometheus metrics. Critical health metrics include etcd_server_has_leader, etcd_server_leader_changes_seen_total, and write‑latency metrics such as etcd_disk_wal_fsync_duration_seconds. A sample alerting rule set is provided for leader loss, frequent leader changes, high proposal failure rate, slow WAL fsync, high network RTT, DB size approaching quota, and backup failures.

groups:
- name: etcd
  rules:
  - alert: EtcdNoLeader
    expr: sum(etcd_server_has_leader) == 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "etcd cluster has no leader"
  - alert: EtcdHighNumberOfLeaderChanges
    expr: increase(etcd_server_leader_changes_seen_total[1h]) > 5
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: "etcd leader changes frequently"
  - alert: EtcdHighFsyncDuration
    expr: histogram_quantile(0.99, sum(rate(etcd_disk_wal_fsync_duration_seconds_bucket[5m])) by (le,instance)) > 0.01
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "etcd WAL fsync P99 > 10ms"

Security hardening

Enable mutual TLS for both client‑server and peer communication, configure RBAC (root user, read‑only role), enforce network isolation with firewall rules, and optionally encrypt backups with GPG. The article also mentions audit logging via the --logger=zap flag.

Common troubleshooting cases

Seven real‑world cases are described, each with symptoms, diagnostic steps, root cause, and remediation. Examples include frequent elections caused by disk I/O contention, quota exhaustion due to large ConfigMaps, split‑brain from network partitions, startup failures from permission or certificate issues, and upgrade problems when moving from etcd 3.4 to 3.5 without a snapshot‑restore.

Upgrade and maintenance

Upgrade strategy follows a rolling approach: backup, test in a staging environment, upgrade one node at a time, wait for leader election stability (≈30 s), verify health metrics, run defrag and compact, and monitor for at least one hour. Version paths (3.3 → 3.4, 3.4 → 3.5) are compatible; direct upgrades across major versions require a snapshot‑restore.

FAQ

Recommended cluster size: 3 nodes (tolerates 1 failure) or 5 nodes (tolerates 2 failures).

Cross‑AZ deployment requires RTT < 50 ms and a majority in the same AZ.

Database size should stay below 8 GiB; > 4 GiB may need compaction.

etcd does not support horizontal sharding natively; use API server sharding or external tools for large scales.

Key count should stay under 1 M for good performance.

K8s Secrets are only base64‑encoded; use external secret managers or encryption for true secrecy.

Real‑world case study

A financial services company migrated from a single‑SSD three‑node setup to dedicated NVMe nodes, upgraded to etcd 3.5.13, introduced RBAC, automated backup to S3, and added comprehensive Prometheus alerts. After the migration, leader changes dropped to zero, write‑latency P99 fell from 50 ms to 5 ms, and monthly disaster‑recovery drills became routine.

Maintenance checklist

Daily: health check, backup verification, silence non‑critical alerts.

Weekly: restore test, metric trend review, certificate expiry audit.

Monthly: version upgrade review, capacity planning, alert rule refinement.

Quarterly: full disaster‑recovery drill, performance stress test, documentation update.

Appendix

Quick reference of common etcdctl commands.

Table of important configuration parameters.

Complete Prometheus alert rule set and recommended Grafana dashboards (etcd‑dashboard, kube‑etcd).

Further reading: official etcd docs, Kubernetes etcd ops guide, and books such as "Designing Data‑Intensive Applications".

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

monitoringperformancekubernetesbackupscalingetcdrestore
Ops Community
Written by

Ops Community

A leading IT operations community where professionals share and grow together.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.