Fundamentals 19 min read

How Raft Powers Reliable Distributed Systems – From Consensus Basics to SOFAJRaft

This article explains distributed consensus, compares Paxos, ZAB and Raft, details Raft's leader election, log replication and commit mechanics, introduces read‑optimisation techniques such as ReadIndex and Lease Read, and showcases the Java‑based SOFAJRaft library with its architecture, performance tricks and real‑world use cases.

Alibaba Cloud Developer
Alibaba Cloud Developer
Alibaba Cloud Developer
How Raft Powers Reliable Distributed Systems – From Consensus Basics to SOFAJRaft

Distributed Consensus Algorithms

Multiple participants must reach complete agreement on a single value; the decision is irrevocable.

Common Algorithms

Paxos – considered the foundation of consensus but complex, especially multi‑Paxos.

Zab – used in ZooKeeper, widely adopted but not a generic library.

Raft – praised for understandability; many implementations exist (etcd, braft, TiKV).

Raft Overview

Strong Leader : only one leader exists at a time and handles all client requests.

Leader sends proposals to followers and collects majority acknowledgments.

Leader sends periodic heartbeats to maintain leadership.

State Machine Replication

Clients write to the leader, which converts operations to a write‑ahead log (WAL) and replicates it to followers. Once a majority have persisted the entry, the leader applies it to the state machine and replies to the client.

Raft Roles

Follower – passive, only responds to leader or candidate messages.

Leader – processes client requests and replicates logs.

Candidate – initiates leader election after timeout.

Message Types

RequestVote RPC – sent by candidates.

AppendEntries (Heartbeat) RPC – sent by leader.

InstallSnapshot RPC – sent by leader.

Term and Election

Time is divided into monotonically increasing terms. Each term begins with a leader election; at most one leader per term, but a term may have no leader.

Leader Election Details

Randomized election timeout reduces split‑vote collisions.

Candidate increments term, sends RequestVote RPC, and becomes leader if it receives a majority.

Log completeness is used to select the most up‑to‑date candidate.

Log Replication

Log entries are tuples (TermId, LogIndex, LogValue). Replication must preserve continuity (no gaps) and validity (identical term and index imply identical value).

Followers check prevTermId and prevLogIndex in AppendEntries for consistency.

If logs diverge, leader decrements nextIndex and retries until followers match.

Commit Index Advancement

CommitIndex marks the highest log entry known to be replicated on a majority. Leaders broadcast the current CommitIndex; followers apply all entries ≤ CommitIndex.

AppendEntries RPC Structure

(currentTerm, logEntries[], prevTerm, prevLogIndex, commitTerm, commitLogIndex)

SOFAJRaft Library

Pure Java implementation of Raft with several enhancements.

Core Features

Leader election, log replication and recovery, snapshot & log compaction.

Membership changes, leader transfer, symmetric/asymmetric network partition tolerance.

Metrics collection and Jepsen‑based fault‑injection testing.

Architecture

Node – entry point exposing apply(task) to submit commands.

LogStorage, MetadataStorage, SnapshotStorage – persistent components.

StateMachine – user business logic (onApply).

Replicator and ReplicatorGroup – handle AppendEntries and heartbeats.

RPC Server/Client – communication between nodes.

Read Optimizations

Linearizable read via full Raft log – costly.

ReadIndex – leader records commit index, sends heartbeat, waits until apply index ≥ ReadIndex.

Lease Read – uses a short lease to skip the heartbeat step, further reducing latency.

Wait‑Free (planned) – would eliminate the final waiting step.

Code Example

// KV store linearizable read
public void readFromQuorum(String key, AsyncContext asyncContext) {
    byte[] reqContext = new byte[4];
    Bits.putInt(reqContext, 0, requestId.incrementAndGet());
    this.node.readIndex(reqContext, new ReadIndexClosure() {
        @Override
        public void run(Status status, long index, byte[] reqCtx) {
            if (status.isOk()) {
                try {
                    asyncContext.sendResponse(new ValueCommand(fsm.getValue(key)));
                } catch (KeyNotFoundException e) {
                    asyncContext.sendResponse(GetCommandProcessor.createKeyNotFoundResponse());
                }
            } else {
                asyncContext.sendResponse(new BooleanCommand(false, status.getErrorMsg()));
            }
        }
    });
}

Use Cases

Leader election, distributed lock services, high‑reliability metadata management.

Distributed storage systems such as message queues, file systems, block stores.

Real‑world deployments: AntQ Streams QCoordinator, Schema Registry, SOFA service registry, RheaKV.

Practical Example – Simple KV Store

Design a KV store on top of SOFAJRaft, illustrating region, store, and PD concepts.

Advanced Example – RheaKV

Embedded, strongly consistent KV store built with SOFAJRaft and RocksDB, featuring self‑diagnosis, auto‑optimization, and self‑recovery.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Distributed SystemsSOFAJRaftRaftConsensus AlgorithmLog Replication
Alibaba Cloud Developer
Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.