Understanding Kafka’s Transition from ZooKeeper to KRaft: Controller Quorum and Raft‑Based Consensus
This article explains how Kafka is moving away from ZooKeeper by introducing the KRaft protocol, describing the Controller Quorum mechanism, Raft‑style state machine, message types, leader election, metadata log replication, and safety guarantees in a distributed system.
Preface
These days many people are seeing headlines that "Kafka will deprecate ZooKeeper" and assume it is the most significant change to this mature messaging system in recent years. Because ZooKeeper handles controller election, broker registration, topic‑partition registration and leader election, consumer/producer metadata management, and load balancing, removing its dependency is not a trivial task.
This article focuses on a fundamental and important aspect: the Controller Quorum mechanism based on the Raft consensus protocol. If you are not familiar with Raft, please read the prerequisite article first.
From Single‑Node Controller to Controller Quorum
Currently the controller is a single broker elected via ZooKeeper, responsible for maintaining the state of all brokers, partitions, and replicas. Without ZooKeeper, the controller must store metadata itself, and a single‑point failure would be catastrophic. Future versions will replace the single controller with a quorum of brokers (an odd number ≥ 3) that can tolerate up to (n/2 ‑ 1) failures. Only one node becomes the active controller, elected via an internal Raft‑style protocol (KRaft). The following diagram shows the quorum layout.
In practice the quorum size must be odd and at least three. The leader election relies on a variant of Raft (KRaft). Below is the state machine for quorum nodes.
Quorum Node State Machine
Under KRaft, a quorum node can be in one of four states:
Candidate – actively initiates an election;
Leader – obtains a majority of votes during election;
Follower – has voted for a candidate or is replicating the leader’s log;
Observer – a follower without voting rights (not considered in this article).
The state transition diagram is similar to classic Raft.
Message Definitions
Classic Raft defines two RPC messages (AppendEntries and RequestVote) and uses a push model. KRaft adopts a pull model and defines several RPC messages:
Vote – election vote information sent by a Candidate;
BeginQuorumEpoch – sent when a new leader is elected to inform other nodes;
EndQuorumEpoch – sent when the current leader steps down, triggering a new election (graceful shutdown);
Fetch – used by Followers/Observers to pull the leader’s log; it also serves as a liveness probe.
The pull model allows consistency checks to be performed on the leader side and speeds up bootstrapping new followers, but it can increase latency for zombie leaders and Fetch operations.
Leader Election
An election is triggered when any of the following occurs:
No Fetch response is received within quorum.fetch.timeout.ms after a Fetch request, indicating a suspected leader failure;
An EndQuorumEpoch request is received from the current leader, indicating it has stepped down;
A candidate does not receive a majority of votes within quorum.election.timeout.ms, causing the election to be aborted and restarted.
The subsequent voting process follows classic Raft, with additional handling for invalid votes (e.g., mismatched cluster ID or epoch).
Metadata Log Replication
After removing ZooKeeper, Kafka treats metadata as a log stored in an internal topic that has a single partition, similar to how consumer offsets are stored.
The metadata record format is almost identical to regular messages but must include the leader’s epoch:
Record => Offset LeaderEpoch ControlType Key Value TimestampFollowers replicate the leader’s log by pulling from this metadata topic, effectively acting as consumers.
Kafka uses the high‑watermark (HW) concept to determine which metadata entries have been replicated to a majority of followers, ensuring durability.
State Machine Safety Guarantees
In terms of safety, KRaft’s election safety, leader append‑only property, log matching, and leader completeness are almost identical to classic Raft. The article briefly illustrates how safety is preserved with an example scenario involving multiple leader crashes and epoch changes.
KRaft adds a stronger constraint: a newly elected leader will not advance the high‑watermark until it has successfully committed logs belonging to its own epoch. Consequently, even if a previous leader’s logs were partially replicated, they are not considered committed, preserving safety.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
