Cloud Native 19 min read

Why Etcd Clusters Use Odd Nodes & What Happens During Leader Election

This article explains etcd’s Raft‑based consensus, why odd‑numbered nodes are recommended, details the leader election process with log excerpts, discusses split‑brain and consistency guarantees, and provides step‑by‑step instructions for generating certificates, deploying an etcd cluster, and using etcdctl commands.

Full-Stack DevOps & Kubernetes

Sep 18, 2020

Why Etcd Clusters Use Odd Nodes & What Happens During Leader Election

Etcd Leader Election Process

etcd stores data using the Raft consensus algorithm. In a three‑node cluster (IDs 100, 101, 102) the cluster is in term 7 with node 101 as leader. Stopping the leader forces the remaining nodes to start a new election. The node that obtains a majority of votes becomes the new leader for the next term.

Key log excerpts (simplified for clarity):

91d63231b87fadda [term 7] received MsgTimeoutNow from 8a4bb0af2f19bd46 and starts an election to get leadership.
91d63231b87fadda became candidate at term 8
91d63231b87fadda received MsgVoteResp from 91d63231b87fadda at term 8
91d63231b87fadda [logterm: 7, index: 4340153] sent MsgVote request to 8a4bb0af2f19bd46 at term 8
91d63231b87fadda [logterm: 7, index: 4340153] sent MsgVote request to 9feab580a25dd270 at term 8
etcd[24203]: lost the TCP streaming connection with peer 8a4bb0af2f19bd46 (stream MsgApp v2 reader)
9feab580a25dd270 [term: 7] received a MsgVote message with higher term from 91d63231b87fadda [term: 8]
9feab580a25dd270 became follower at term 8
9feab580a25dd270 [logterm: 7, index: 4340153, vote: 0] cast MsgVote for 91d63231b87fadda
9feab580a25dd270 elected leader 91d63231b87fadda at term 8
91d63231b87fadda elected leader 91d63231b87fadda at term 8

Node 102 (ID 91d63231b87fadda) receives its own vote and a vote from node 100, reaching a majority (2/3) and becoming leader for term 8.

Must an etcd cluster have an odd number of nodes?

Raft requires a strict majority (> 50 %) of votes to elect a leader. Official guidance recommends 3, 5 or 7 nodes, but an even‑sized cluster also works; it simply does not improve fault tolerance because a single failure can break the quorum.

1 node – single‑instance, no clustering.

2 nodes – can elect a leader, but loss of either node eliminates quorum → no fault tolerance.

3 nodes – tolerates one failure.

4 nodes – still tolerates only one failure (majority = 3).

5 nodes – tolerates two failures.

6 nodes – tolerates two failures (majority = 4).

Adding an extra node to an already fault‑tolerant cluster increases hardware cost without improving availability or write performance; it may even degrade throughput because each write must be replicated to a larger majority.

Does adding more etcd nodes improve performance?

etcd forms a Raft group; every write must be persisted on a majority of members before the leader can commit. Adding nodes raises the number of replicas that must acknowledge each write, increasing network latency and CPU/memory usage. Consequently, write throughput declines as the cluster grows. For most production workloads, 3, 5 or 7 nodes provide a good balance between availability and performance. Larger clusters rarely bring measurable benefits.

In Kubernetes HA setups the typical pattern is three master nodes, each running an etcd member.

Does etcd suffer from split‑brain?

A split‑brain situation occurs when two disjoint subsets of a cluster each believe they have a valid leader. etcd avoids this by requiring a strict majority for leadership. The minority partition loses quorum and becomes unavailable; the majority continues to serve reads and writes.

The majority side becomes the available cluster and the minority side is unavailable; there is no “split‑brain” in etcd.

Example: a 5‑node cluster with node 5 as leader experiences a network partition that isolates nodes 1‑2‑4 from node 5. The isolated side cannot gather a majority, so it cannot elect a new leader and remains read‑only (linearizable reads fail because they need quorum). The majority side continues normal operation.

Is etcd strongly consistent?

Yes. Both reads and writes provide linearizable consistency.

etcd v2 – set quorum=true in the SDK or rely on the default etcdctl behaviour to obtain linearizable reads.

etcd v3 – use WithSerializable=true in the client SDK or --consistency=l (linearizable) with etcdctl. The default is linearizable ("l"); "s" selects serializable (stale) reads.

Early v3 releases routed every read through the Raft log, which added latency. Since v3.1 etcd records the current commit index and returns a read once the state machine has applied that index, preserving linearizability while reducing read latency.

Etcd Cluster Deployment

Below is a concise, step‑by‑step procedure to create a TLS‑secured three‑node etcd cluster (etcd v3.4). Adjust IP addresses and hostnames as needed.

1. Generate a Certificate Authority (CA)

Create ca-config.json to define expiry and profile:

{
  "signing": {
    "default": { "expiry": "87600h" },
    "profiles": {
      "demo": {
        "usages": ["signing","key encipherment","server auth","client auth"],
        "expiry": "87600h"
      }
    }
  }
}

Create ca-csr.json for the CA subject:

{
  "CN": "demo",
  "key": { "algo": "rsa", "size": 2048 },
  "names": [{ "C": "CN", "ST": "BeiJing", "L": "BeiJing", "O": "demo", "OU": "cloudnative" }]
}

Generate the CA certificate and key:

cfssl gencert -initca ca-csr.json | cfssljson -bare ca

Resulting files: ca.pem (public), ca-key.pem (private), ca.csr.

2. Generate etcd member certificates

Create etcd-csr.json – list the IPs of all etcd members in the hosts array:

{
  "CN": "demo",
  "hosts": ["127.0.0.1", "ip1", "ip2", "ip3"],
  "key": { "algo": "rsa", "size": 2048 },
  "names": [{ "C": "CN", "ST": "BeiJing", "L": "BeiJing", "O": "demo", "OU": "cloudnative" }]
}

Generate the member certificate signed by the CA:

cfssl gencert -ca=ca.pem -ca-key=ca-key.pem -config=ca-config.json -profile=demo etcd-csr.json | cfssljson -bare etcd

Resulting files: etcd.pem, etcd-key.pem, etcd.csr.

Copy ca.pem, etcd.pem and etcd-key.pem to the same directory on each cluster node (e.g., /etc/etcd/ssl/).

3. Start etcd on each node

Example command for node 0 (replace the IP placeholders with the actual addresses of the three nodes):

./etcd \
  --name=etcd-0 \
  --client-cert-auth=true \
  --cert-file=/etc/etcd/ssl/etcd.pem \
  --key-file=/etc/etcd/ssl/etcd-key.pem \
  --peer-cert-file=/etc/etcd/ssl/etcd.pem \
  --peer-key-file=/etc/etcd/ssl/etcd-key.pem \
  --trusted-ca-file=/etc/etcd/ssl/ca.pem \
  --peer-trusted-ca-file=/etc/etcd/ssl/ca.pem \
  --initial-advertise-peer-urls https://100.0.0.0:2380 \
  --listen-peer-urls https://100.0.0.0:2380 \
  --listen-client-urls https://100.0.0.0:2379,https://127.0.0.1:2379 \
  --advertise-client-urls https://100.0.0.0:2379 \
  --initial-cluster-token etcd-cluster \
  --initial-cluster etcd-0=https://100.0.0.0:2380,etcd-1=https://100.0.0.1:2380,etcd-2=https://100.0.0.2:2380 \
  --initial-cluster-state new \
  --quota-backend-bytes=8589934592 \
  --auto-compaction-retention=10 \
  --enable-pprof=true \
  --data-dir=/var/lib/etcd

Repeat the command on the other two nodes, changing --name, --initial-advertise-peer-urls, --listen-peer-urls, and --listen-client-urls to match each node’s IP.

etcdctl Commands (TLS‑enabled cluster)

All etcdctl invocations must provide the CA, client certificate and key. Example to query cluster status:

ETCDCTL_API=3 ./etcdctl \
  --endpoints=https://0:2379,https://1:2379,https://2:2379 \
  --cacert /etc/etcd/ssl/ca.pem \
  --cert /etc/etcd/ssl/etcd.pem \
  --key /etc/etcd/ssl/etcd-key.pem \
  endpoint status --write-out=table

version

– show etcd version. member list – list cluster members. endpoint status – show leader and health of each endpoint. endpoint health – health check with latency. put <key> <value> – write a key. get <key> – read a key. update <key> <value> – modify a key. del <key> – delete a key. mkdir <dir> – create a directory (prefix). rmdir <dir> – remove a directory. snapshot save <file> – backup the cluster. watch <key> – monitor changes. get / --prefix --keys-only – list all keys.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

distributed systems Kubernetes Certificate Raft Etcd Cluster Deployment linearizable read

Written by

Full-Stack DevOps & Kubernetes

Focused on sharing DevOps, Kubernetes, Linux, Docker, Istio, microservices, Spring Cloud, Python, Go, databases, Nginx, Tomcat, cloud computing, and related technologies.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.