How Raft Achieves Consensus: Leader Election, Log Replication, and State Machine Explained
This article explains the core mechanisms of the Raft consensus algorithm—including leader election, log replication, safety guarantees, message structures, state transitions, and key Go implementations in etcd-raft—providing code examples and detailed analysis of functions such as becomeLeader, tickElection, and appendEntry.
In essence, the Raft algorithm achieves consensus by always following the leader, ensuring a series of values and node logs stay consistent.
Leader election: after a leader fails, the cluster quickly selects a new leader.
Log replication: only the leader can write logs; it replicates them to followers and forces followers to stay identical.
Safety: only one leader per term, committed entries survive leader changes, and all state machines apply identical log entries.
Leader Election
Raft treats every operation—election, commit, etc.—as a message struct fed into the Raft state machine. The etcd‑raft library defines this struct as follows:
type Message struct {
Type MessageType `protobuf:"varint,1,opt,name=type,enum=raftpb.MessageType" json:"type"` // message type
To uint64 `protobuf:"varint,2,opt,name=to" json:"to"` // receiver node ID
From uint64 `protobuf:"varint,3,opt,name=from" json:"from"` // sender node ID
Term uint64 `protobuf:"varint,4,opt,name=term" json:"term"` // sender's term (0 for local messages)
LogTerm uint64 `protobuf:"varint,5,opt,name=logTerm" json:"logTerm"` // term of first entry in the message
Index uint64 `protobuf:"varint,6,opt,name=index" json:"index"` // log index for follower reports
Entries []Entry `protobuf:"bytes,7,rep,name=entries" json:"entries"` // entries to replicate (MsgApp)
Commit uint64 `protobuf:"varint,8,opt,name=commit" json:"commit"` // committed log index
Snapshot Snapshot `protobuf:"bytes,9,opt,name=snapshot" json:"snapshot"` // snapshot data
Reject bool `protobuf:"varint,10,opt,name=reject" json:"reject"` // whether the message is rejected
RejectHint uint64 `protobuf:"varint,11,opt,name=rejectHint" json:"rejectHint"` // hint for rejected messages
Context []byte `protobuf:"bytes,12,opt,name=context" json:"context,omitempty"` // optional context information
XXX_unrecognized []byte `json:"-"`
}The MessageType enum contains 19 values, only a subset of which are used for a given message:
MsgHup = 0 // follower election timeout
MsgBeat = 1 // leader heartbeat
MsgProp = 2 // client write request
MsgApp = 3 // leader sends entries to follower
MsgAppResp = 4 // follower response to MsgApp
MsgVote = 5 // candidate requests vote
MsgVoteResp = 6 // vote response
MsgSnap = 7 // leader sends snapshot
MsgHeartbeat = 8 // leader heartbeat (alternative name)
MsgHeartbeatResp = 9 // follower heartbeat response
MsgUnreachable = 10 // message cannot be delivered
MsgSnapStatus = 11 // snapshot status after send failure
MsgCheckQuorum = 12 // leader checks quorum
MsgTransferLeader = 13 // leader transfer (local)
MsgTimeoutNow = 14 // force immediate election
MsgReadIndex = 15 // read‑only request
MsgReadIndexResp = 16 // read‑only response
MsgPreVote = 17 // pre‑candidate vote request
MsgPreVoteResp = 18 // pre‑candidate vote responseThe node struct wraps the etcd‑raft library, exposing channels for different message categories:
type node struct {
propc chan msgWithResult // MsgProp messages
recvc chan pb.Message // all other messages
confc chan pb.ConfChangeV2 // configuration changes
confstatec chan pb.ConfState // current cluster configuration
readyc chan Ready // Ready instances for the upper layer
advancec chan struct{} // signals that Ready has been processed
tickc chan struct{} // logical clock ticks
done chan struct{} // closed when node stops
stop chan struct{} // triggers node shutdown
status chan chan Status // status queries
rn *RawNode
}Raft nodes can be in three states: Candidate, Follower, or Leader. Each state has its own tick and step functions, and the state machine transitions are illustrated below:
etcd‑raft adds a PreCandidate state to avoid unnecessary elections when logs are far behind. The state‑transition functions are defined in raft.go:
func (r *raft) becomeFollower(term uint64, lead uint64) {
r.step = stepFollower
r.reset(term)
r.tick = r.tickElection
r.lead = lead
r.state = StateFollower
}
func (r *raft) becomePreCandidate() {
if r.state == StateLeader {
panic("invalid transition [leader -> pre-candidate]")
}
r.step = stepCandidate
r.prs.ResetVotes()
r.tick = r.tickElection
r.lead = None
r.state = StatePreCandidate
}
func (r *raft) becomeCandidate() {
if r.state == StateLeader {
panic("invalid transition [leader -> candidate]")
}
r.step = stepCandidate
r.reset(r.Term + 1)
r.tick = r.tickElection
r.Vote = r.id
r.state = StateCandidate
}
func (r *raft) becomeLeader() {
if r.state == StateFollower {
panic("invalid transition [follower -> leader]")
}
r.step = stepLeader
r.reset(r.Term)
r.tick = r.tickHeartbeat
r.lead = r.id
r.state = StateLeader
r.prs.Progress[r.id].BecomeReplicate()
r.pendingConfIndex = r.raftLog.lastIndex()
emptyEnt := pb.Entry{Data: nil}
if !r.appendEntry(emptyEnt) {}
r.reduceUncommittedSize([]pb.Entry{emptyEnt})
}The tickElection and tickHeartbeat functions drive time‑based actions:
func (r *raft) tickElection() {
r.electionElapsed++
if r.promotable() && r.pastElectionTimeout() {
r.electionElapsed = 0
r.Step(pb.Message{From: r.id, Type: pb.MsgHup})
}
}
func (r *raft) tickHeartbeat() {
r.heartbeatElapsed++
r.electionElapsed++
if r.electionElapsed >= r.electionTimeout {
r.electionElapsed = 0
if r.checkQuorum {
r.Step(pb.Message{From: r.id, Type: pb.MsgCheckQuorum})
}
if r.state == StateLeader && r.leadTransferee != None {
r.abortLeaderTransfer()
}
}
if r.state != StateLeader {
return
}
if r.heartbeatElapsed >= r.heartbeatTimeout {
r.heartbeatElapsed = 0
r.Step(pb.Message{From: r.id, Type: pb.MsgBeat})
}
}Log replication is performed by appendEntry, which sets term and index, appends to the raft log, updates the leader’s progress, and attempts to commit:
func (r *raft) appendEntry(es ...pb.Entry) (accepted bool) {
li := r.raftLog.lastIndex()
for i := range es {
es[i].Term = r.Term
es[i].Index = li + 1 + uint64(i)
}
li = r.raftLog.append(es...)
r.prs.Progress[r.id].MaybeUpdate(li)
r.maybeCommit()
return true
}The Progress.MaybeUpdate method adjusts the follower’s match and next indices:
func (pr *Progress) MaybeUpdate(n uint64) bool {
var updated bool
if pr.Match < n {
pr.Match = n
updated = true
pr.ProbeAcked()
}
pr.Next = max(pr.Next, n+1)
return updated
}Commit decisions are made in maybeCommit based on the majority of matched indices:
func (r *raft) maybeCommit() bool {
mci := r.prs.Committed()
return r.raftLog.maybeCommit(mci, r.Term)
}All Raft messages are processed by the Step function, which first checks the term and then dispatches based on message type (e.g., MsgHup, MsgVote, MsgPreVote, etc.). The function may invoke state‑specific handlers such as hup, campaign, or the current step implementation.
func (r *raft) Step(m pb.Message) error {
switch {
case m.Term == 0:
// local message, ignore
case m.Term > r.Term:
// handle higher term
case m.Term < r.Term:
// ignore stale term
}
switch m.Type {
case pb.MsgHup:
if r.preVote {
// start pre‑vote
} else {
r.hup(campaignElection)
}
case pb.MsgVote, pb.MsgPreVote:
canVote := r.Vote == m.From || (r.Vote == None && r.lead == None) || (m.Type == pb.MsgPreVote && m.Term > r.Term)
if canVote && r.raftLog.isUpToDate(m.Index, m.LogTerm) {
r.send(pb.Message{To: m.From, Term: m.Term, Type: voteRespMsgType(m.Type)})
if m.Type == pb.MsgVote {
r.electionElapsed = 0
r.Vote = m.From
}
} else {
r.send(pb.Message{To: m.From, Term: r.Term, Type: voteRespMsgType(m.Type), Reject: true})
}
default:
if err := r.step(r, m); err != nil {
return err
}
}
return nil
}When a follower’s election timer expires, it triggers tickElection, which creates a MsgHup that moves the node to PreCandidate, then sends MsgPreVote to the cluster.
Reference: GeeksTime etcd practical course and the book "etcd Inside".
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Ops Development Stories
Maintained by a like‑minded team, covering both operations and development. Topics span Linux ops, DevOps toolchain, Kubernetes containerization, monitoring, log collection, network security, and Python or Go development. Team members: Qiao Ke, wanger, Dong Ge, Su Xin, Hua Zai, Zheng Ge, Teacher Xia.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
