Fundamentals 28 min read

Consistency, CAP Theorem, and Distributed Consensus Protocols (2PC, 3PC, Paxos, Raft, Zookeeper)

The article explains how the CAP theorem forces trade‑offs between consistency, availability and partition tolerance, then surveys distributed commit protocols (2PC, 3PC) and consensus algorithms (Paxos, Raft, Zookeeper’s ZAB), and shows their practical use in systems such as ZanKV that combine Raft with RocksDB for strongly consistent, fault‑tolerant key‑value storage.

Youzan Coder
Youzan Coder
Youzan Coder
Consistency, CAP Theorem, and Distributed Consensus Protocols (2PC, 3PC, Paxos, Raft, Zookeeper)

1. Consistency and CAP Theory

In a distributed environment, Consistency (C) means that all replicas hold the same value at the same time. Availability (A) requires the system to remain serviceable even when some nodes fail. Partition tolerance (P) means the system continues to operate despite network partitions; when a partition occurs, a choice must be made between C and A.

The impossibility of satisfying C, A, and P simultaneously is illustrated with a five‑node example (n1‑n5) split across two data centers. If the network between the centers is cut, guaranteeing consistency forces writes in one center to fail, losing availability. Guaranteeing availability forces writes to succeed in both centers, leading to divergent data and loss of consistency.

Common practical scenarios:

CA without P – network partitions are inevitable (natural disasters, cable cuts).

CP without A – strong consistency requires waiting for all replicas, which may block availability (e.g., two‑phase commit).

AP without C – high availability with asynchronous replication can cause stale reads (e.g., Redis master‑slave).

2. Consistency Protocols

2.1 Two‑Phase Commit (2PC)

2PC is a coordination algorithm used to keep distributed transactions atomic. It involves a coordinator and participants. The protocol consists of:

Prepare phase : the coordinator sends a Prepare message; each participant writes redo/undo logs and replies with agreement or abort.

Commit phase : if all participants agree, the coordinator sends Commit ; otherwise it sends Abort .

Failure handling:

Coordinator failure – a standby coordinator takes over and queries participants.

Participant failure – the coordinator waits for the participant to restart.

Both coordinator and participant failure – 2PC cannot resolve this; a three‑phase commit is needed.

Drawbacks:

Synchronous blocking : all participants hold locks until the transaction commits. Example SQL statement: update tablesetstatus=1wherecurrent_day=20181103 locks rows matching current_day=20181103 .

Single‑point‑of‑failure : if the coordinator crashes during commit, all participants remain blocked.

Coordinator and participant simultaneous crash : the system cannot determine the transaction outcome.

2.2 Three‑Phase Commit (3PC)

3PC adds a preparation phase between the two 2PC phases and introduces timeout mechanisms, allowing the system to make progress even if both coordinator and participants crash.

Can‑commit (prepare) phase – participants acknowledge readiness without locking resources.

Pre‑commit phase – participants write redo/undo logs after receiving Prepare .

Commit phase – final Commit is sent; participants apply the transaction.

Drawbacks include inability to handle network partitions that cause divergent logs.

2.3 Paxos Protocol

Paxos, introduced by Lamport in 1990, achieves consensus on a value among distributed nodes even when a minority are offline. Each node can act as both proposer and acceptor.

It inspired Zookeeper’s ZAB protocol.

2.4 Raft Protocol

Raft provides an easier‑to‑understand alternative to Paxos. Key concepts:

Roles : Leader, Follower, Candidate.

Term : monotonically increasing election term identifier.

RequestVote and AppendEntries RPCs.

Election timeout triggers a new election when a follower loses contact with the leader.

Properties:

Leaders only append logs; they never modify existing entries.

Consistency is guaranteed by logical term and log indices rather than physical timestamps.

Leader election occurs when a follower’s election timeout expires or when a node starts up. The election proceeds until a candidate receives votes from a majority.

Log replication works as follows: the leader receives a client request (e.g., x=1 ), writes it to its local log, replicates the entry to followers, and once a majority acknowledges, the entry is committed and applied to the state machine.

Two fundamental invariants ensure consistency:

If two logs share the same term and index, their entries are identical.

If two log segments share the same start and end term/index, the entire segment is identical.

3. Zookeeper Principles

Zookeeper implements the ZAB (Zookeeper Atomic Broadcast) protocol, a variant of Paxos where a single leader (proposer) coordinates consensus.

Core concepts:

Data node (znode): the smallest unit in Zookeeper’s hierarchical namespace.

Transaction and zxid: each state‑changing operation receives a globally unique 64‑bit transaction ID (zxid) composed of a high‑order election epoch and a low‑order per‑epoch counter.

Transaction log and snapshots: logs record all state‑changing operations; snapshots capture the full in‑memory state at intervals.

Roles: leader, follower, observer.

Node states: LOOKING (election), FOLLOWING, LEADING.

Common misconceptions:

Writes are immediately readable – they become visible only after a majority of nodes have persisted the transaction; reads may be served by any node.

Cluster size must be odd – Zookeeper only requires a majority, so a 3‑node cluster is as robust as a 4‑node one.

Election process:

Nodes start in LOOKING state and exchange votes.

The node with the highest zxid (or highest server ID if zxids are equal) wins.

Once a majority agrees, the winner becomes leader.

Synchronization after leader election uses three strategies:

Direct differential sync : when a learner’s last zxid lies between the leader’s min and max committed zxids, only missing entries are sent.

Rollback‑then‑diff sync : if the learner is ahead of the leader, it rolls back to the last common zxid before receiving missing entries.

Full sync : when the learner’s last zxid is older than the leader’s min committed zxid, a snapshot (SNAP) is transferred.

4. Raft + RocksDB in ZanKV (Youzan Distributed KV Store)

ZanKV (open‑sourced as ZanRedisDB) combines Raft consensus with RocksDB’s LSM‑tree storage to provide a CP‑style key‑value service compatible with the Redis protocol. Writes are sent as Redis commands (e.g., setx=1 ), Raft replicates the operation to all nodes, and RocksDB persists the data. The design ensures strong consistency even under network partitions or node failures, and supports easy scaling via a mapping table that assigns shards to nodes.

5. Summary

The article covered three main aspects: the CAP theorem and its proof, consistency protocols (with emphasis on Raft), and Zookeeper’s underlying mechanisms. It highlighted the importance of write‑ahead logging (WAL) to prevent data loss and explained how mature consensus algorithms (Paxos, ZAB, Raft) address consistency, availability, and fault‑tolerance in distributed storage systems.

distributed systemsCAP theoremZookeeper2PCconsistencyRaftPaxosconsensus protocols
Youzan Coder
Written by

Youzan Coder

Official Youzan tech channel, delivering technical insights and occasional daily updates from the Youzan tech team.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.