Fundamentals 19 min read

Mastering Distributed Consensus: Inside Raft and the SOFAJRaft Library

This article explains the fundamentals of distributed consensus algorithms, compares Paxos, Zab, and Raft, dives deep into Raft's leader election, log replication, and commit mechanics, and showcases the Java‑based SOFAJRaft library with its architecture, linearizable read optimizations, and real‑world use cases.

Alibaba Cloud Developer

Jun 1, 2021

Mastering Distributed Consensus: Inside Raft and the SOFAJRaft Library

Distributed Consensus Algorithms

Multiple participants must reach a single, immutable decision about an event.

Common algorithms

Paxos – the foundational algorithm, but only single‑proposal; multi‑paxos adds high engineering complexity.

Zab – used by ZooKeeper, widely adopted but not provided as a generic library.

Raft – designed for understandability; many implementations such as etcd, braft, TiKV.

Raft Overview

Raft relies on a strong leader that alone can receive client requests and replicate proposals to followers.

The leader communicates proactively with all followers, sending proposals and collecting majority acknowledgments.

The leader also sends periodic heartbeats to maintain its leadership.

Followers are passive; they only respond to messages from the leader or a candidate.

State Machine Replication

Clients' write operations are turned into a write‑ahead log (WAL). The leader writes the log locally and replicates it to followers. Once a majority acknowledges, the leader applies the operation to the state machine and replies to the client.

Raft Core Concepts

Node Roles

Follower – passive, initial state.

Leader – handles all client requests and log replication.

Candidate – triggered by election timeout to start a new election.

Message Types

RequestVote RPC – sent by candidates.

AppendEntries (Heartbeat) RPC – sent by the leader.

InstallSnapshot RPC – sent by the leader.

Term and Election

Time is divided into monotonically increasing terms.

Each term begins with a leader election; at most one leader per term (a term may have no leader).

Election uses random timeouts to reduce split‑vote collisions.

Raft Functional Decomposition

Leader Election

When a follower times out it becomes a candidate, increments its term, and sends RequestVote RPCs. The candidate wins if it receives votes from a majority.

Log Replication

Log entries are tuples (TermId, LogIndex, LogValue). Replication ensures continuity (no gaps) and consistency (identical entries on nodes with the same term and index).

Commit Index

The commit index marks the highest log position that has been replicated on a majority and can be applied to the state machine. Leaders advance it in AppendEntries RPCs; followers update their local commit index accordingly.

SOFAJRaft Library

SOFAJRaft is a pure‑Java implementation of the Raft algorithm, providing the following components:

Node – the entry point for clients; exposes apply(task) to submit commands.

LogStorage, MetadataStorage, SnapshotStorage – persistent stores for logs, meta‑information, and optional snapshots.

StateMachine – user‑defined logic executed via onApply.

Replicator and ReplicatorGroup – handle AppendEntries and heartbeat communication.

RPC Server/Client – internal communication between nodes.

KV Store – a typical application built on top of SOFAJRaft.

Linearizable Reads

ReadIndex optimization records the current commit index, sends a heartbeat to confirm leadership, and waits until the local apply index exceeds the recorded index before serving the read.

Lease Read further reduces latency by using a short lease period during which the leader can serve reads without an extra heartbeat.

Future “wait‑free” reads aim to eliminate the remaining latency by ensuring the leader’s state machine is up‑to‑date after applying its first log entry in a term.

Implementation Example

// KV store linearizable read using SOFAJRaft
public void readFromQuorum(String key, AsyncContext asyncContext) {
    byte[] reqContext = new byte[4];
    Bits.putInt(reqContext, 0, requestId.incrementAndGet());
    this.node.readIndex(reqContext, new ReadIndexClosure() {
        @Override
        public void run(Status status, long index, byte[] reqCtx) {
            if (status.isOk()) {
                try {
                    asyncContext.sendResponse(new ValueCommand(fsm.getValue(key)));
                } catch (KeyNotFoundException e) {
                    asyncContext.sendResponse(GetCommandProcessor.createKeyNotFoundResponse());
                }
            } else {
                asyncContext.sendResponse(new BooleanCommand(false, status.getErrorMsg()));
            }
        }
    });
}

Use Cases

Leader election and distributed lock services (e.g., ZooKeeper).

Highly reliable metadata management.

Distributed storage systems such as message queues, file systems, and block stores.

Embedded strong‑consistency KV stores (e.g., RheaKV).

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

State Machine SOFAJRaft Raft Leader Election distributed consensus Log Replication linearizable read

Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.