How to Build a Raft‑Based Distributed Scheduler on Mesos with Go

This article explains the fundamentals of consensus in distributed systems, compares Paxos and Raft, and provides a step‑by‑step guide with code snippets on embedding the etcd/raft library into the open‑source Mesos scheduler Swan to achieve reliable multi‑node data synchronization.

dbaplus Community
dbaplus Community
dbaplus Community
How to Build a Raft‑Based Distributed Scheduler on Mesos with Go

Background

The open‑source project swan runs on Apache Mesos. Its built‑in distributed storage did not meet all requirements, so the team embedded a Raft consensus module to guarantee data synchronization across multiple nodes. The implementation uses the Go Raft library from etcd/raft (CoreOS).

Raft Consensus Overview

Raft provides a simple, understandable consensus protocol. Each server assumes one of three roles:

Leader – handles client requests and replicates log entries to followers.

Follower – passive, receives log entries from the leader.

Candidate – requests votes to become leader during an election.

An election succeeds when a candidate obtains a majority (⌊N/2⌋+1) of votes. For fault tolerance, a Raft cluster is typically configured with an odd number of nodes (e.g., 3 or 5) so that the system can tolerate up to ⌊N/2⌋ node failures.

Implementation in swan

Node Startup

When a RaftNode starts it knows the intended cluster size and writes the IDs of the other nodes into its configuration. To enable auto‑join, the Peers parameter may be left empty. If the service has run before, the node reads the last state from the Write‑Ahead Log (WAL) and resumes from that point.

// Pseudo‑code for starting a Raft node
node := NewRaftNode(id, peers, storage)
if storage.HasSnapshot() {
    node.Restore(storage.LoadSnapshot())
}
node.Start()

Transport Layer

Node‑to‑node communication uses the httptransport implementation provided by etcd. A gRPC‑based transport can be substituted (see swarmkit). The transport is created, peer addresses are added, and a TCP listener is started to receive Raft messages.

t := transport.NewHTTPTransport()
for _, peer := range peerAddrs {
    t.AddPeer(peer.ID, peer.URL)
}
listener, _ := net.Listen("tcp", localAddr)
go t.Serve(listener)
Raft node startup diagram
Raft node startup diagram

Data Submission and Event Handling

Clients submit data via the Propose method, which forwards the request to the current leader. The leader validates the request, replicates the entry to followers, and notifies the service through the Ready() channel. Because Propose is asynchronous, services that require strong consistency may convert the flow to a synchronous one (e.g., by waiting on the Ready notification or using a helper such as swarmkit’s ProposeValue).

The Ready processing typically performs three tasks:

Detect leadership changes (e.g., a follower becoming leader) and emit a leadershipChange event for external components.

Persist CommittedEntries to stable storage.

Advance the Raft state machine by calling node.Advance().

External services must listen for the leadership change event; write operations must be directed to the leader, either by proxying client requests or by invoking leader‑only gRPC APIs.

WAL, Snapshot and Log Compression

Each entry received on Ready() is first saved to the WAL via wal.Save. When the number of persisted entries exceeds a configured threshold, a snapshot is created with saveSnapshot. After snapshot creation the WAL releases the entries that are now represented in the snapshot using wal.ReleaseLockTo. Snapshots serve two purposes:

Compress the log to prevent unbounded growth and reduce replay time.

Provide a point‑in‑time state that can be transferred to new or recovering nodes.

Snapshot generation diagram
Snapshot generation diagram

Operational Notes

Running a three‑node Raft cluster in swan involves starting three identical RaftNode instances with the same configuration. After the initial node is up, the same startup procedure is repeated for the remaining nodes. The cluster tolerates the loss of a single node; a new election is automatically triggered if the leader fails.

Adding or removing nodes requires invoking ProposeConfChange with the appropriate configuration change entry. Clients must be aware of leader changes and either retry on the new leader or use a proxy layer that forwards write requests.

References

Raft animation: http://thesecretlivesofdata.com/raft/

Original Raft paper: https://ramcloud.stanford.edu/wiki/download/attachments/11370504/raft.pdf

etcd/raft source code: https://github.com/coreos/etcd/tree/master/raft

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Backend DevelopmentGoMesosRaftConsensus
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.