How Kafka’s Broker Controller Keeps Your Data Flowing – Inside the Replication Engine

This article dives deep into Kafka’s internal mechanics, explaining how brokers replicate data, how the controller coordinates the cluster via ZooKeeper, the roles of leader and follower replicas, ISR management, request handling, fail‑over strategies, and consumer group rebalancing, all illustrated with diagrams.

Programmer DD
Programmer DD
Programmer DD
How Kafka’s Broker Controller Keeps Your Data Flowing – Inside the Replication Engine

Cluster Member Relationships

Kafka runs on top of ZooKeeper, which itself is a cluster, so Kafka can also form a cluster. Each broker has a unique broker.id that can be set manually or generated automatically.

Kafka can generate a new broker.id using broker.id.generation.enable and reserved.broker.max.id . The default is true, and the generated IDs start from 1001.

When a broker starts, it registers a temporary node under /brokers/ids in ZooKeeper. This node is used for health checks. If a broker with the same ID tries to start, registration fails because the node already exists.

If you start another broker with the same ID, you get an error because ZooKeeper already has that ID.

If a broker disconnects due to a pause or GC, its temporary node is removed and other components are notified.

When a broker shuts down, its node disappears, but the ID remains in other structures such as the topic’s replica list. After a full shutdown, a new broker with the same ID can join and inherit the old partitions.

Broker Controller Role

The controller is a core component of Kafka. Only one broker becomes the controller after the cluster starts, using ZooKeeper to manage and coordinate the whole cluster. ZooKeeper is a hierarchical node store; each node is called a znode. A znode can be persistent or temporary, and it supports a watcher mechanism that notifies clients of changes.

Controller Election

Kafka elects a controller by having the first broker create a temporary node /controller in ZooKeeper. Subsequent brokers attempt to create the same node, receive a “node already exists” exception, and then set a watch on /controller. When the controller node changes, other brokers are notified. This design introduces a single‑point‑of‑failure problem.

Controller Functions

Topic Management

: The controller handles creation, deletion, and partition addition for topics. Partition Reassignment: The controller executes the kafka-reassign-partitions script to redistribute partitions. Preferred Leader Election: The controller can trigger a preferred leader election to balance load. Cluster Member Management: It manages broker additions, removals, and failures. Data Service: The controller stores the full cluster metadata and periodically sends updates to all brokers.

When a broker leaves, the controller determines which partitions need a new leader, selects one, and notifies the relevant brokers. The new leader then starts handling producer and consumer requests, while followers replicate from it.

Broker Controller Data Storage

The controller’s data can be categorized into three groups:

All broker information, including partitions, replicas, and the status of each broker.

Topic information, such as partition leaders and the ISR set.

Operational tasks, like partitions undergoing preferred leader election or reassignment.

This metadata is also stored in ZooKeeper; during initialization the controller reads it from ZooKeeper into its cache.

Fail‑Over

Because only one broker can be the controller, Kafka provides a fail‑over mechanism. If the current controller crashes, ZooKeeper detects the missing temporary node, and the remaining brokers compete to become the new controller by creating /controller. The winner reads the metadata from ZooKeeper and initializes its cache.

Note: The cache lives in the broker; ZooKeeper stores the persistent metadata.

Problems with the Old Design

State changes were handled by multiple listeners concurrently, requiring complex synchronization and making debugging difficult.

State propagation could be out of sync, leading to potential data loss.

The controller created extra I/O threads for topic deletion, hurting performance.

Heavy use of ReentrantLock for thread safety slowed down processing.

New Design (Kafka 0.11+)

The controller switched from a multithreaded model to a single‑threaded event‑queue model.

An Event Executor Thread processes events from an event queue, handling both the queue and controller context.

All ZooKeeper interactions became asynchronous, improving efficiency by up to tenfold.

Requests are prioritized; for example, a StopReplica request can be given higher priority than regular produce requests.

Replication Mechanism

Kafka is described as a distributed, partitioned, replicated commit‑log service. Replication ensures high availability by keeping copies of data on multiple brokers.

Each topic is divided into partitions; each partition has one leader replica and one or more follower replicas.

Leader Replica

The leader is elected when a partition is created and handles all client requests for that partition.

Follower Replica

Followers do not serve client requests. They pull data asynchronously from the leader and write it to their own logs. If a follower falls behind the leader for more than replica.lag.time.max.ms (default 10 s), it is considered out‑of‑sync and removed from the ISR set.

If a follower is fully in sync, it becomes part of the ISR set and can be elected leader if the current leader fails.

ISR (In‑Sync Replicas)

Kafka maintains a dynamic set of in‑sync replicas (ISR). A follower stays in ISR as long as it replicates within the configured lag time. If it falls behind, it is removed; if it catches up later, it can re‑join ISR.

Unclean Leader Election

If the ISR set becomes empty, Kafka can optionally elect a leader from out‑of‑sync replicas by enabling unclean.leader.election.enable. This sacrifices consistency for availability and is generally discouraged.

Kafka Request Processing Flow

Brokers handle client, replica, and controller requests using a request/response model over TCP sockets.

Synchronous requests block the client until a response is ready, while asynchronous requests create separate threads, which can be costly.

Kafka uses a reactive (Reactor) model rather than pure sync or async.

The broker’s SocketServer accepts connections and hands them to a network thread pool ( Processor). The number of network threads is configurable via num.network.threads (default 3). Each processor uses a shared request queue and a response queue. An I/O thread pool (default 8 threads, configurable via num.io.threads) processes the requests.

Produce requests write messages to the log.

Fetch requests read messages from disk or page cache.

Requests contain a header with fields such as request type (API key), version, correlation ID, and client ID.

Request Types

Produce : Clients send messages to a leader; the leader writes to its log and may wait for followers depending on the acks setting.

Fetch : Consumers request messages; Kafka can use zero‑copy to send data directly from file to network.

Metadata : Clients request cluster metadata (brokers, partitions, leaders) to know where to send produce/fetch requests.

Kafka Rebalance Process

Consumer groups use a group coordinator (a broker) to manage membership and partition assignment. The coordinator tracks states via a state machine: Empty, Dead, PreparingRebalance, CompletingRebalance, Stable.

From the Consumer Perspective

Consumers first send a JoinGroup request to join a group. The first member becomes the group leader, collects subscriptions, and computes the assignment. Then the leader sends a SyncGroup request with the assignment. Once all members receive the assignment, the group reaches the Stable state and starts consuming.

From the Coordinator Perspective

New member joins the group.

Member leaves voluntarily via LeaveGroup.

Member crashes (no heartbeat within session.timeout.ms).

Member commits offsets during rebalance.

Each of these events triggers a rebalance, causing the group to transition through PreparingRebalance, CompletingRebalance, and back to Stable.

Rebalance Steps

All members join and submit their subscriptions.

The leader computes the partition assignment.

Leader sends the assignment via SyncGroup.

Members receive the assignment and start consuming.

If the leader or any broker fails during rebalance, the group may transition to Dead and later recover.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

BackendZooKeeperKafkaReplicationISRrebalanceBroker Controller
Programmer DD
Written by

Programmer DD

A tinkering programmer and author of "Spring Cloud Microservices in Action"

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.