Mastering Raft: Building a Distributed KV Store in Elixir from Scratch
This article walks through the core concepts of the Raft consensus algorithm, explains why Elixir is a good fit, details the leader election and log‑replication mechanisms, and shows how to integrate a user state machine to create a fully functional distributed key‑value database with snapshots and recovery.
Introduction
Raft is a distributed consensus algorithm introduced by Diego Ongaro and John Ousterhout in 2013. It is widely used in systems that require strong consistency, such as etcd, Consul, CockroachDB, and many internal middleware platforms.
Core Concepts
Log replication state machine : an append‑only ordered sequence of entries that represents the system’s state.
Leader, Follower, Candidate : roles that enable a single node to coordinate log replication and client requests.
Understanding these concepts is essential because the log is the only reliable source of truth in a Raft cluster.
Why Elixir?
Excellent networking support inherited from Erlang.
Powerful interactive shell for debugging.
Lightweight concurrency model that makes building distributed protocols straightforward.
OTP’s gen_statem provides a ready‑made state‑machine framework.
The implementation also draws inspiration from the production‑grade Raft library “Longboat”.
Leader Election Implementation
The election process follows the original Raft paper:
Term : a monotonically increasing number that defines a leader’s tenure.
Election timer : each follower starts a randomized timer; expiration triggers a vote request.
Message types : request_vote, append_entries, etc.
State‑machine framework : implemented with Erlang’s gen_statem.
Key details include:
When a node becomes a candidate, it increments its term and votes for itself.
Followers grant a vote only if the candidate’s log is at least as up‑to‑date as theirs.
If a candidate receives votes from a majority, it becomes the leader.
Leader Responsibilities
After election, the leader continuously sends heartbeat messages to followers to maintain its authority and to drive log replication.
Followers reset their election timers upon receiving a valid heartbeat.
Log Replication
The leader appends client commands to its local log and replicates them to followers using AppendEntries (called Replicate in this implementation). Once a log entry is stored on a majority of nodes, it is considered committed and can be applied to the state machine.
The safety of the algorithm relies on strict term and index checks during replication and commit phases.
Log Replication Safety Discussion
Both followers and the leader verify term consistency before accepting or committing entries. This prevents stale leaders from overwriting newer logs and ensures that only entries from the current term are committed.
User State Machine
The state machine executes deterministic business commands derived from committed log entries. It provides APIs for:
Starting and stopping the machine.
Applying commands (PUT, DELETE).
Reading current state.
Creating and loading snapshots.
Snapshots capture both Raft metadata (current term, commit index, cluster configuration) and the application’s key‑value data, allowing fast recovery.
Distributed KVDB Implementation
A simple key‑value store is built on top of the Raft core:
State machine : in‑memory map with JSON‑based snapshot serialization.
KVDB API : show, put, delete, read, save_snapshot.
Read operations use a “dirty read” from the local map, while writes and deletes are proposals that go through the Raft log.
Verification
Three Erlang nodes are started, forming a stable Raft cluster (node 1 becomes leader). The following scenarios are demonstrated:
Write foo1:bar1 on the leader; reads on followers return bar1.
Delete foo1 on a follower; subsequent reads on all nodes return no value.
Create a snapshot, restart a node, and verify that the node recovers the key‑value state from disk.
Conclusion
The article provides a deep dive into Raft’s core mechanisms, demonstrates a complete Elixir implementation, and shows how to build a distributed KV store with snapshot support. It also outlines open challenges such as consistent reads, handling network partitions, optimal commit/apply timing, and dynamic membership.
References
https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-data‑unifying
https://zhuanlan.zhihu.com/p/618127949
https://github.com/maemual/raft-zh_cn/blob/master/raft-zh_cn.md#%E5%AF%BB%E6%89%BE%E4%B8%80%E7%A7%8D%E6%98%93%E4%BA%8E%E7%90%86%E8%A7%A3%E7%9A%84%E4%B8%80%E8%87%B4%E6%80%A7%E7%AE%97%E6%B3%95%E6%89%A9%E5%B1%95%E7%89%88
https://github.com/lucaong/cubdb
https://dl.acm.org/action/doSearch?AllField=raft
├── README.md
├── Taskfile.yaml
├── lib
│ └── ex_raft
│ ├── config.ex
│ ├── core
│ │ ├── candidate.ex
│ │ ├── common.ex
│ │ ├── follower.ex
│ │ ├── free.ex
│ │ ├── leader.ex
│ │ └── prevote.ex
│ ├── debug.ex
│ ├── exception.ex
│ ├── guards.ex
│ ├── log_store
│ │ ├── cub.ex
│ │ └── inmem.ex
│ ├── log_store.ex
│ ├── message_handlers
│ │ ├── candidate.ex
│ │ ├── follower.ex
│ │ ├── leader.ex
│ │ └── prevote.ex
│ ├── mock
│ │ └── statemachine.ex
│ ├── models
│ │ ├── replica.ex
│ │ └── replica_state.ex
│ ├── pb
│ │ ├── ex_raft.pb.ex
│ │ ├── ex_raft.proto
│ │ └── gen.sh
│ ├── remote
│ │ ├── client.ex
│ │ └── erlang.ex
│ ├── remote.ex
│ ├── replica.ex
│ ├── serialize.ex
│ ├── server.ex
│ ├── statemachine.ex
│ ├── typespecs.ex
│ └── utils
│ ├── buffer.ex
│ └── uvaint.ex
├── mix.exs
├── mix.lock
├── test
│ ├── ex_raft_test.exs
│ ├── log_store.exs
│ └── test_helper.exs
└── tmpSigned-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
