Mastering Distributed Consensus: Inside Raft and the SOFAJRaft Library
This article explains the fundamentals of distributed consensus algorithms, compares Paxos, Zab, and Raft, dives deep into Raft's leader election, log replication, and commit mechanics, and showcases the Java‑based SOFAJRaft library with its architecture, linearizable read optimizations, and real‑world use cases.
Distributed Consensus Algorithms
Multiple participants must reach a single, immutable decision about an event.
Common algorithms
Paxos – the foundational algorithm, but only single‑proposal; multi‑paxos adds high engineering complexity.
Zab – used by ZooKeeper, widely adopted but not provided as a generic library.
Raft – designed for understandability; many implementations such as etcd, braft, TiKV.
Raft Overview
Raft relies on a strong leader that alone can receive client requests and replicate proposals to followers.
The leader communicates proactively with all followers, sending proposals and collecting majority acknowledgments.
The leader also sends periodic heartbeats to maintain its leadership.
Followers are passive; they only respond to messages from the leader or a candidate.
State Machine Replication
Clients' write operations are turned into a write‑ahead log (WAL). The leader writes the log locally and replicates it to followers. Once a majority acknowledges, the leader applies the operation to the state machine and replies to the client.
Raft Core Concepts
Node Roles
Follower – passive, initial state.
Leader – handles all client requests and log replication.
Candidate – triggered by election timeout to start a new election.
Message Types
RequestVote RPC – sent by candidates.
AppendEntries (Heartbeat) RPC – sent by the leader.
InstallSnapshot RPC – sent by the leader.
Term and Election
Time is divided into monotonically increasing terms.
Each term begins with a leader election; at most one leader per term (a term may have no leader).
Election uses random timeouts to reduce split‑vote collisions.
Raft Functional Decomposition
Leader Election
When a follower times out it becomes a candidate, increments its term, and sends RequestVote RPCs. The candidate wins if it receives votes from a majority.
Log Replication
Log entries are tuples (TermId, LogIndex, LogValue). Replication ensures continuity (no gaps) and consistency (identical entries on nodes with the same term and index).
Commit Index
The commit index marks the highest log position that has been replicated on a majority and can be applied to the state machine. Leaders advance it in AppendEntries RPCs; followers update their local commit index accordingly.
SOFAJRaft Library
SOFAJRaft is a pure‑Java implementation of the Raft algorithm, providing the following components:
Node – the entry point for clients; exposes apply(task) to submit commands.
LogStorage, MetadataStorage, SnapshotStorage – persistent stores for logs, meta‑information, and optional snapshots.
StateMachine – user‑defined logic executed via onApply.
Replicator and ReplicatorGroup – handle AppendEntries and heartbeat communication.
RPC Server/Client – internal communication between nodes.
KV Store – a typical application built on top of SOFAJRaft.
Linearizable Reads
ReadIndex optimization records the current commit index, sends a heartbeat to confirm leadership, and waits until the local apply index exceeds the recorded index before serving the read.
Lease Read further reduces latency by using a short lease period during which the leader can serve reads without an extra heartbeat.
Future “wait‑free” reads aim to eliminate the remaining latency by ensuring the leader’s state machine is up‑to‑date after applying its first log entry in a term.
Implementation Example
// KV store linearizable read using SOFAJRaft
public void readFromQuorum(String key, AsyncContext asyncContext) {
byte[] reqContext = new byte[4];
Bits.putInt(reqContext, 0, requestId.incrementAndGet());
this.node.readIndex(reqContext, new ReadIndexClosure() {
@Override
public void run(Status status, long index, byte[] reqCtx) {
if (status.isOk()) {
try {
asyncContext.sendResponse(new ValueCommand(fsm.getValue(key)));
} catch (KeyNotFoundException e) {
asyncContext.sendResponse(GetCommandProcessor.createKeyNotFoundResponse());
}
} else {
asyncContext.sendResponse(new BooleanCommand(false, status.getErrorMsg()));
}
}
});
}Use Cases
Leader election and distributed lock services (e.g., ZooKeeper).
Highly reliable metadata management.
Distributed storage systems such as message queues, file systems, and block stores.
Embedded strong‑consistency KV stores (e.g., RheaKV).
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
