Backend Development 14 min read

How to Build a Raft‑Based Distributed Scheduler on Mesos with Go

This article explains the fundamentals of consensus in distributed systems, compares Paxos and Raft, and provides a step‑by‑step guide with code snippets on embedding the etcd/raft library into the open‑source Mesos scheduler Swan to achieve reliable multi‑node data synchronization.

dbaplus Community

Jan 4, 2017

How to Build a Raft‑Based Distributed Scheduler on Mesos with Go

Background

The open‑source project swan runs on Apache Mesos. Its built‑in distributed storage did not meet all requirements, so the team embedded a Raft consensus module to guarantee data synchronization across multiple nodes. The implementation uses the Go Raft library from etcd/raft (CoreOS).

Raft Consensus Overview

Raft provides a simple, understandable consensus protocol. Each server assumes one of three roles:

Leader – handles client requests and replicates log entries to followers.

Follower – passive, receives log entries from the leader.

Candidate – requests votes to become leader during an election.

An election succeeds when a candidate obtains a majority (⌊N/2⌋+1) of votes. For fault tolerance, a Raft cluster is typically configured with an odd number of nodes (e.g., 3 or 5) so that the system can tolerate up to ⌊N/2⌋ node failures.

Implementation in swan

Node Startup

When a RaftNode starts it knows the intended cluster size and writes the IDs of the other nodes into its configuration. To enable auto‑join, the Peers parameter may be left empty. If the service has run before, the node reads the last state from the Write‑Ahead Log (WAL) and resumes from that point.

// Pseudo‑code for starting a Raft node
node := NewRaftNode(id, peers, storage)
if storage.HasSnapshot() {
    node.Restore(storage.LoadSnapshot())
}
node.Start()

Transport Layer

Node‑to‑node communication uses the httptransport implementation provided by etcd. A gRPC‑based transport can be substituted (see swarmkit). The transport is created, peer addresses are added, and a TCP listener is started to receive Raft messages.

t := transport.NewHTTPTransport()
for _, peer := range peerAddrs {
    t.AddPeer(peer.ID, peer.URL)
}
listener, _ := net.Listen("tcp", localAddr)
go t.Serve(listener)

Data Submission and Event Handling

Clients submit data via the Propose method, which forwards the request to the current leader. The leader validates the request, replicates the entry to followers, and notifies the service through the Ready() channel. Because Propose is asynchronous, services that require strong consistency may convert the flow to a synchronous one (e.g., by waiting on the Ready notification or using a helper such as swarmkit’s ProposeValue).

The Ready processing typically performs three tasks:

Detect leadership changes (e.g., a follower becoming leader) and emit a leadershipChange event for external components.

Persist CommittedEntries to stable storage.

Advance the Raft state machine by calling node.Advance().

External services must listen for the leadership change event; write operations must be directed to the leader, either by proxying client requests or by invoking leader‑only gRPC APIs.

WAL, Snapshot and Log Compression

Each entry received on Ready() is first saved to the WAL via wal.Save. When the number of persisted entries exceeds a configured threshold, a snapshot is created with saveSnapshot. After snapshot creation the WAL releases the entries that are now represented in the snapshot using wal.ReleaseLockTo. Snapshots serve two purposes:

Compress the log to prevent unbounded growth and reduce replay time.

Provide a point‑in‑time state that can be transferred to new or recovering nodes.

Operational Notes

Running a three‑node Raft cluster in swan involves starting three identical RaftNode instances with the same configuration. After the initial node is up, the same startup procedure is repeated for the remaining nodes. The cluster tolerates the loss of a single node; a new election is automatically triggered if the leader fails.

Adding or removing nodes requires invoking ProposeConfChange with the appropriate configuration change entry. Clients must be aware of leader changes and either retry on the new leader or use a proxy layer that forwards write requests.

References

Raft animation: http://thesecretlivesofdata.com/raft/

Original Raft paper: https://ramcloud.stanford.edu/wiki/download/attachments/11370504/raft.pdf

etcd/raft source code: https://github.com/coreos/etcd/tree/master/raft

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Backend Development Go Mesos Raft Consensus

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.