How Raft Powers Reliable Distributed Systems – From Consensus Basics to SOFAJRaft
This article explains distributed consensus, compares Paxos, ZAB and Raft, details Raft's leader election, log replication and commit mechanics, introduces read‑optimisation techniques such as ReadIndex and Lease Read, and showcases the Java‑based SOFAJRaft library with its architecture, performance tricks and real‑world use cases.
Distributed Consensus Algorithms
Multiple participants must reach complete agreement on a single value; the decision is irrevocable.
Common Algorithms
Paxos – considered the foundation of consensus but complex, especially multi‑Paxos.
Zab – used in ZooKeeper, widely adopted but not a generic library.
Raft – praised for understandability; many implementations exist (etcd, braft, TiKV).
Raft Overview
Strong Leader : only one leader exists at a time and handles all client requests.
Leader sends proposals to followers and collects majority acknowledgments.
Leader sends periodic heartbeats to maintain leadership.
State Machine Replication
Clients write to the leader, which converts operations to a write‑ahead log (WAL) and replicates it to followers. Once a majority have persisted the entry, the leader applies it to the state machine and replies to the client.
Raft Roles
Follower – passive, only responds to leader or candidate messages.
Leader – processes client requests and replicates logs.
Candidate – initiates leader election after timeout.
Message Types
RequestVote RPC – sent by candidates.
AppendEntries (Heartbeat) RPC – sent by leader.
InstallSnapshot RPC – sent by leader.
Term and Election
Time is divided into monotonically increasing terms. Each term begins with a leader election; at most one leader per term, but a term may have no leader.
Leader Election Details
Randomized election timeout reduces split‑vote collisions.
Candidate increments term, sends RequestVote RPC, and becomes leader if it receives a majority.
Log completeness is used to select the most up‑to‑date candidate.
Log Replication
Log entries are tuples (TermId, LogIndex, LogValue). Replication must preserve continuity (no gaps) and validity (identical term and index imply identical value).
Followers check prevTermId and prevLogIndex in AppendEntries for consistency.
If logs diverge, leader decrements nextIndex and retries until followers match.
Commit Index Advancement
CommitIndex marks the highest log entry known to be replicated on a majority. Leaders broadcast the current CommitIndex; followers apply all entries ≤ CommitIndex.
AppendEntries RPC Structure
(currentTerm, logEntries[], prevTerm, prevLogIndex, commitTerm, commitLogIndex)
SOFAJRaft Library
Pure Java implementation of Raft with several enhancements.
Core Features
Leader election, log replication and recovery, snapshot & log compaction.
Membership changes, leader transfer, symmetric/asymmetric network partition tolerance.
Metrics collection and Jepsen‑based fault‑injection testing.
Architecture
Node – entry point exposing apply(task) to submit commands.
LogStorage, MetadataStorage, SnapshotStorage – persistent components.
StateMachine – user business logic (onApply).
Replicator and ReplicatorGroup – handle AppendEntries and heartbeats.
RPC Server/Client – communication between nodes.
Read Optimizations
Linearizable read via full Raft log – costly.
ReadIndex – leader records commit index, sends heartbeat, waits until apply index ≥ ReadIndex.
Lease Read – uses a short lease to skip the heartbeat step, further reducing latency.
Wait‑Free (planned) – would eliminate the final waiting step.
Code Example
// KV store linearizable read
public void readFromQuorum(String key, AsyncContext asyncContext) {
byte[] reqContext = new byte[4];
Bits.putInt(reqContext, 0, requestId.incrementAndGet());
this.node.readIndex(reqContext, new ReadIndexClosure() {
@Override
public void run(Status status, long index, byte[] reqCtx) {
if (status.isOk()) {
try {
asyncContext.sendResponse(new ValueCommand(fsm.getValue(key)));
} catch (KeyNotFoundException e) {
asyncContext.sendResponse(GetCommandProcessor.createKeyNotFoundResponse());
}
} else {
asyncContext.sendResponse(new BooleanCommand(false, status.getErrorMsg()));
}
}
});
}Use Cases
Leader election, distributed lock services, high‑reliability metadata management.
Distributed storage systems such as message queues, file systems, block stores.
Real‑world deployments: AntQ Streams QCoordinator, Schema Registry, SOFA service registry, RheaKV.
Practical Example – Simple KV Store
Design a KV store on top of SOFAJRaft, illustrating region, store, and PD concepts.
Advanced Example – RheaKV
Embedded, strongly consistent KV store built with SOFAJRaft and RocksDB, featuring self‑diagnosis, auto‑optimization, and self‑recovery.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
