How to Build a Raft‑Based Distributed Scheduler on Mesos with Go
This article explains the fundamentals of consensus in distributed systems, compares Paxos and Raft, and provides a step‑by‑step guide with code snippets on embedding the etcd/raft library into the open‑source Mesos scheduler Swan to achieve reliable multi‑node data synchronization.
Background
The open‑source project swan runs on Apache Mesos. Its built‑in distributed storage did not meet all requirements, so the team embedded a Raft consensus module to guarantee data synchronization across multiple nodes. The implementation uses the Go Raft library from etcd/raft (CoreOS).
Raft Consensus Overview
Raft provides a simple, understandable consensus protocol. Each server assumes one of three roles:
Leader – handles client requests and replicates log entries to followers.
Follower – passive, receives log entries from the leader.
Candidate – requests votes to become leader during an election.
An election succeeds when a candidate obtains a majority (⌊N/2⌋+1) of votes. For fault tolerance, a Raft cluster is typically configured with an odd number of nodes (e.g., 3 or 5) so that the system can tolerate up to ⌊N/2⌋ node failures.
Implementation in swan
Node Startup
When a RaftNode starts it knows the intended cluster size and writes the IDs of the other nodes into its configuration. To enable auto‑join, the Peers parameter may be left empty. If the service has run before, the node reads the last state from the Write‑Ahead Log (WAL) and resumes from that point.
// Pseudo‑code for starting a Raft node
node := NewRaftNode(id, peers, storage)
if storage.HasSnapshot() {
node.Restore(storage.LoadSnapshot())
}
node.Start()Transport Layer
Node‑to‑node communication uses the httptransport implementation provided by etcd. A gRPC‑based transport can be substituted (see swarmkit). The transport is created, peer addresses are added, and a TCP listener is started to receive Raft messages.
t := transport.NewHTTPTransport()
for _, peer := range peerAddrs {
t.AddPeer(peer.ID, peer.URL)
}
listener, _ := net.Listen("tcp", localAddr)
go t.Serve(listener)Data Submission and Event Handling
Clients submit data via the Propose method, which forwards the request to the current leader. The leader validates the request, replicates the entry to followers, and notifies the service through the Ready() channel. Because Propose is asynchronous, services that require strong consistency may convert the flow to a synchronous one (e.g., by waiting on the Ready notification or using a helper such as swarmkit’s ProposeValue).
The Ready processing typically performs three tasks:
Detect leadership changes (e.g., a follower becoming leader) and emit a leadershipChange event for external components.
Persist CommittedEntries to stable storage.
Advance the Raft state machine by calling node.Advance().
External services must listen for the leadership change event; write operations must be directed to the leader, either by proxying client requests or by invoking leader‑only gRPC APIs.
WAL, Snapshot and Log Compression
Each entry received on Ready() is first saved to the WAL via wal.Save. When the number of persisted entries exceeds a configured threshold, a snapshot is created with saveSnapshot. After snapshot creation the WAL releases the entries that are now represented in the snapshot using wal.ReleaseLockTo. Snapshots serve two purposes:
Compress the log to prevent unbounded growth and reduce replay time.
Provide a point‑in‑time state that can be transferred to new or recovering nodes.
Operational Notes
Running a three‑node Raft cluster in swan involves starting three identical RaftNode instances with the same configuration. After the initial node is up, the same startup procedure is repeated for the remaining nodes. The cluster tolerates the loss of a single node; a new election is automatically triggered if the leader fails.
Adding or removing nodes requires invoking ProposeConfChange with the appropriate configuration change entry. Clients must be aware of leader changes and either retry on the new leader or use a proxy layer that forwards write requests.
References
Raft animation: http://thesecretlivesofdata.com/raft/
Original Raft paper: https://ramcloud.stanford.edu/wiki/download/attachments/11370504/raft.pdf
etcd/raft source code: https://github.com/coreos/etcd/tree/master/raft
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
