Backend Development 21 min read

How Raft Achieves Consensus: Leader Election, Log Replication, and State Machine Explained

This article explains the core mechanisms of the Raft consensus algorithm—including leader election, log replication, safety guarantees, message structures, state transitions, and key Go implementations in etcd-raft—providing code examples and detailed analysis of functions such as becomeLeader, tickElection, and appendEntry.

Ops Development Stories
Ops Development Stories
Ops Development Stories
How Raft Achieves Consensus: Leader Election, Log Replication, and State Machine Explained

In essence, the Raft algorithm achieves consensus by always following the leader, ensuring a series of values and node logs stay consistent.

Leader election: after a leader fails, the cluster quickly selects a new leader.

Log replication: only the leader can write logs; it replicates them to followers and forces followers to stay identical.

Safety: only one leader per term, committed entries survive leader changes, and all state machines apply identical log entries.

Leader Election

Raft treats every operation—election, commit, etc.—as a message struct fed into the Raft state machine. The etcd‑raft library defines this struct as follows:

<code>type Message struct {
    Type MessageType `protobuf:"varint,1,opt,name=type,enum=raftpb.MessageType" json:"type"` // message type
    To uint64 `protobuf:"varint,2,opt,name=to" json:"to"` // receiver node ID
    From uint64 `protobuf:"varint,3,opt,name=from" json:"from"` // sender node ID
    Term uint64 `protobuf:"varint,4,opt,name=term" json:"term"` // sender's term (0 for local messages)
    LogTerm uint64 `protobuf:"varint,5,opt,name=logTerm" json:"logTerm"` // term of first entry in the message
    Index uint64 `protobuf:"varint,6,opt,name=index" json:"index"` // log index for follower reports
    Entries []Entry `protobuf:"bytes,7,rep,name=entries" json:"entries"` // entries to replicate (MsgApp)
    Commit uint64 `protobuf:"varint,8,opt,name=commit" json:"commit"` // committed log index
    Snapshot Snapshot `protobuf:"bytes,9,opt,name=snapshot" json:"snapshot"` // snapshot data
    Reject bool `protobuf:"varint,10,opt,name=reject" json:"reject"` // whether the message is rejected
    RejectHint uint64 `protobuf:"varint,11,opt,name=rejectHint" json:"rejectHint"` // hint for rejected messages
    Context []byte `protobuf:"bytes,12,opt,name=context" json:"context,omitempty"` // optional context information
    XXX_unrecognized []byte `json:"-"`
}
</code>

The

MessageType

enum contains 19 values, only a subset of which are used for a given message:

<code>MsgHup = 0          // follower election timeout
MsgBeat = 1         // leader heartbeat
MsgProp = 2         // client write request
MsgApp = 3          // leader sends entries to follower
MsgAppResp = 4      // follower response to MsgApp
MsgVote = 5         // candidate requests vote
MsgVoteResp = 6     // vote response
MsgSnap = 7         // leader sends snapshot
MsgHeartbeat = 8    // leader heartbeat (alternative name)
MsgHeartbeatResp = 9 // follower heartbeat response
MsgUnreachable = 10 // message cannot be delivered
MsgSnapStatus = 11   // snapshot status after send failure
MsgCheckQuorum = 12  // leader checks quorum
MsgTransferLeader = 13 // leader transfer (local)
MsgTimeoutNow = 14   // force immediate election
MsgReadIndex = 15   // read‑only request
MsgReadIndexResp = 16 // read‑only response
MsgPreVote = 17      // pre‑candidate vote request
MsgPreVoteResp = 18  // pre‑candidate vote response
</code>

The

node

struct wraps the etcd‑raft library, exposing channels for different message categories:

<code>type node struct {
    propc      chan msgWithResult   // MsgProp messages
    recvc      chan pb.Message      // all other messages
    confc      chan pb.ConfChangeV2 // configuration changes
    confstatec chan pb.ConfState    // current cluster configuration
    readyc     chan Ready           // Ready instances for the upper layer
    advancec   chan struct{}        // signals that Ready has been processed
    tickc      chan struct{}        // logical clock ticks
    done       chan struct{}        // closed when node stops
    stop       chan struct{}        // triggers node shutdown
    status     chan chan Status     // status queries
    rn         *RawNode
}
</code>

Raft nodes can be in three states: Candidate, Follower, or Leader. Each state has its own tick and step functions, and the state machine transitions are illustrated below:

etcd‑raft adds a PreCandidate state to avoid unnecessary elections when logs are far behind. The state‑transition functions are defined in

raft.go

:

<code>func (r *raft) becomeFollower(term uint64, lead uint64) {
    r.step = stepFollower
    r.reset(term)
    r.tick = r.tickElection
    r.lead = lead
    r.state = StateFollower
}

func (r *raft) becomePreCandidate() {
    if r.state == StateLeader {
        panic("invalid transition [leader -> pre-candidate]")
    }
    r.step = stepCandidate
    r.prs.ResetVotes()
    r.tick = r.tickElection
    r.lead = None
    r.state = StatePreCandidate
}

func (r *raft) becomeCandidate() {
    if r.state == StateLeader {
        panic("invalid transition [leader -> candidate]")
    }
    r.step = stepCandidate
    r.reset(r.Term + 1)
    r.tick = r.tickElection
    r.Vote = r.id
    r.state = StateCandidate
}

func (r *raft) becomeLeader() {
    if r.state == StateFollower {
        panic("invalid transition [follower -> leader]")
    }
    r.step = stepLeader
    r.reset(r.Term)
    r.tick = r.tickHeartbeat
    r.lead = r.id
    r.state = StateLeader
    r.prs.Progress[r.id].BecomeReplicate()
    r.pendingConfIndex = r.raftLog.lastIndex()
    emptyEnt := pb.Entry{Data: nil}
    if !r.appendEntry(emptyEnt) {}
    r.reduceUncommittedSize([]pb.Entry{emptyEnt})
}
</code>

The

tickElection

and

tickHeartbeat

functions drive time‑based actions:

<code>func (r *raft) tickElection() {
    r.electionElapsed++
    if r.promotable() && r.pastElectionTimeout() {
        r.electionElapsed = 0
        r.Step(pb.Message{From: r.id, Type: pb.MsgHup})
    }
}

func (r *raft) tickHeartbeat() {
    r.heartbeatElapsed++
    r.electionElapsed++
    if r.electionElapsed >= r.electionTimeout {
        r.electionElapsed = 0
        if r.checkQuorum {
            r.Step(pb.Message{From: r.id, Type: pb.MsgCheckQuorum})
        }
        if r.state == StateLeader && r.leadTransferee != None {
            r.abortLeaderTransfer()
        }
    }
    if r.state != StateLeader {
        return
    }
    if r.heartbeatElapsed >= r.heartbeatTimeout {
        r.heartbeatElapsed = 0
        r.Step(pb.Message{From: r.id, Type: pb.MsgBeat})
    }
}
</code>

Log replication is performed by

appendEntry

, which sets term and index, appends to the raft log, updates the leader’s progress, and attempts to commit:

<code>func (r *raft) appendEntry(es ...pb.Entry) (accepted bool) {
    li := r.raftLog.lastIndex()
    for i := range es {
        es[i].Term = r.Term
        es[i].Index = li + 1 + uint64(i)
    }
    li = r.raftLog.append(es...)
    r.prs.Progress[r.id].MaybeUpdate(li)
    r.maybeCommit()
    return true
}
</code>

The

Progress.MaybeUpdate

method adjusts the follower’s match and next indices:

<code>func (pr *Progress) MaybeUpdate(n uint64) bool {
    var updated bool
    if pr.Match < n {
        pr.Match = n
        updated = true
        pr.ProbeAcked()
    }
    pr.Next = max(pr.Next, n+1)
    return updated
}
</code>

Commit decisions are made in

maybeCommit

based on the majority of matched indices:

<code>func (r *raft) maybeCommit() bool {
    mci := r.prs.Committed()
    return r.raftLog.maybeCommit(mci, r.Term)
}
</code>

All Raft messages are processed by the

Step

function, which first checks the term and then dispatches based on message type (e.g.,

MsgHup

,

MsgVote

,

MsgPreVote

, etc.). The function may invoke state‑specific handlers such as

hup

,

campaign

, or the current

step

implementation.

<code>func (r *raft) Step(m pb.Message) error {
    switch {
    case m.Term == 0:
        // local message, ignore
    case m.Term > r.Term:
        // handle higher term
    case m.Term < r.Term:
        // ignore stale term
    }
    switch m.Type {
    case pb.MsgHup:
        if r.preVote {
            // start pre‑vote
        } else {
            r.hup(campaignElection)
        }
    case pb.MsgVote, pb.MsgPreVote:
        canVote := r.Vote == m.From || (r.Vote == None && r.lead == None) || (m.Type == pb.MsgPreVote && m.Term > r.Term)
        if canVote && r.raftLog.isUpToDate(m.Index, m.LogTerm) {
            r.send(pb.Message{To: m.From, Term: m.Term, Type: voteRespMsgType(m.Type)})
            if m.Type == pb.MsgVote {
                r.electionElapsed = 0
                r.Vote = m.From
            }
        } else {
            r.send(pb.Message{To: m.From, Term: r.Term, Type: voteRespMsgType(m.Type), Reject: true})
        }
    default:
        if err := r.step(r, m); err != nil {
            return err
        }
    }
    return nil
}
</code>

When a follower’s election timer expires, it triggers

tickElection

, which creates a

MsgHup

that moves the node to

PreCandidate

, then sends

MsgPreVote

to the cluster.

Reference: GeeksTime etcd practical course and the book "etcd Inside".

distributed systemsGoRaftconsensusetcd
Ops Development Stories
Written by

Ops Development Stories

Maintained by a like‑minded team, covering both operations and development. Topics span Linux ops, DevOps toolchain, Kubernetes containerization, monitoring, log collection, network security, and Python or Go development. Team members: Qiao Ke, wanger, Dong Ge, Su Xin, Hua Zai, Zheng Ge, Teacher Xia.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.