Big Data 20 min read

How Kafka’s High‑Level Consumer Works, Rebalance Challenges, and the Next‑Gen Design

This article explains Kafka’s High‑Level and Low‑Level consumer models, the semantics of Consumer Groups, the rebalance algorithm and its drawbacks, and outlines the planned redesign in Kafka 0.9.x that introduces a central Coordinator to solve herd and split‑brain issues.

dbaplus Community
dbaplus Community
dbaplus Community
How Kafka’s High‑Level Consumer Works, Rebalance Challenges, and the Next‑Gen Design

High Level Consumer

Kafka’s High‑Level Consumer provides a high‑level abstraction that hides offset handling and offers semantics such as single‑consumer (unicast) or broadcast consumption, allowing client programs to read data without managing low‑level details.

Consumer Group

Offsets are stored in ZooKeeper (or a dedicated Kafka topic from version 0.8.2) under a name called a Consumer Group, which is global to the whole cluster. Every High‑Level Consumer instance belongs to a group; if none is specified, a default group is used. The ZooKeeper node layout is illustrated below.

Unlike traditional message queues that delete messages after consumption, Kafka retains all messages. It guarantees that within a Consumer Group each message is consumed by only one consumer, while allowing different groups to consume the same message simultaneously, enabling diversified processing.

Kafka’s design supports both real‑time (e.g., Storm) and batch (e.g., Hadoop) processing by assigning each processing pipeline to a separate Consumer Group.

Consumer Group Test

A test creates a topic topic1, one consumer in group1, and three consumers in group2. A producer sends three keyed messages (1, 2, 3). The group1 consumer receives all three messages, while each group2 consumer receives exactly one keyed message, demonstrating the group semantics.

Consumer Rebalance

Rebalance redistributes partitions among consumers when the number of consumers changes. Kafka assigns partitions to consumers on a per‑partition basis, not per‑message, which reduces communication overhead but can lead to uneven load.

When the number of consumers is less than the number of partitions, some consumers handle multiple partitions; when greater, some consumers receive no partitions.

Examples with three partitions and up to four consumers illustrate how partitions are allocated and how consumers lose or gain ownership as they start or stop.

The rebalance algorithm is:

Sort all partitions of the target topic and store in PT.

Sort all consumers in the group and store in CG; the i‑th consumer is Ci.

Compute N = ceil(|PT| / |CG|).

Revoke Ci’s current partition ownership.

Assign partitions PT[i·N … (i+1)·N‑1] to Ci.

In version 0.8.2.1 each consumer registers a ZooKeeper watch; any change in consumers or brokers triggers a rebalance.

High‑Level Consumer registers its ID under /consumers/[group]/ids/[id] and watches /consumers/[group]/ids, /brokers/ids, and optionally /brokers/topics.

It then forces a rebalance within its group.

Drawbacks of this approach include the “herd effect” (all consumers rebalance on any change) and “split‑brain” (different consumers see inconsistent ZooKeeper views), leading to incorrect rebalance attempts.

Future Redesign (Kafka 0.9.x)

The community plans to introduce a central Coordinator to handle failure detection and rebalance, eliminating herd and split‑brain problems and reducing ZooKeeper load.

Design Directions

Simplify consumer client : Reduce dependencies to make non‑Java clients easier to implement.

Central Coordinator : A broker‑elected coordinator would manage rebalance logic for a group.

Manual offset management : Allow applications to store offsets in external databases, requiring APIs that expose message metadata.

Rebalance callbacks : Enable user‑defined callbacks to persist state during rebalance.

Non‑blocking consumer API : Support stream‑processing patterns such as filter, group‑by, and join.

Coordinator‑Based Rebalance Flow

Consumer sends ConsumerMetadataRequest to discover its group’s coordinator.

Coordinator returns ConsumerMetadataResponse with broker information.

Consumer periodically sends HeartbeatRequest. If the response contains IllegalGeneration, the consumer stops fetching, commits offsets, and sends JoinGroupRequest to the coordinator.

Coordinator assigns partitions and returns a JoinGroupResponse containing the new generation ID and partition list.

Consumer resumes fetching with the new assignment.

ConsumerMetadataRequest {
    GroupId => String
}
ConsumerMetadataResponse {
    ErrorCode => int16
    Coordinator => Broker
}
HeartbeatRequest {
    GroupId => String
    GroupGenerationId => int32
    ConsumerId => String
}
HeartbeatResponse {
    ErrorCode => int16
}
JoinGroupRequest {
    GroupId => String
    SessionTimeout => int32
    Topics => [String]
    ConsumerId => String
    PartitionAssignmentStrategy => String
}
JoinGroupResponse {
    ErrorCode => int16
    GroupGenerationId => int32
    ConsumerId => String
    PartitionsToOwn => [TopicName [Partition]]
}

Consumer State Machine

Down – consumer stopped.

Start up & discover coordinator – sends JoinGroupRequest after locating coordinator.

Part of a group – periodically sends heartbeats; on errors transitions to other states.

Rediscover coordinator – re‑issues metadata request when coordinator changes.

Stopped consumption – commits offsets and waits to re‑join.

Failure Detection

Consumers and coordinators exchange heartbeats; missing heartbeats within session.timeout.ms mark the counterpart as dead, triggering a rebalance.

Coordinator Responsibilities

Track health of all consumers in its groups.

Load group and member metadata from ZooKeeper on startup.

Reject requests with CoordinatorStartupNotComplete until ready.

Watch for topic/partition changes and trigger rebalance.

The coordinator’s own state machine includes Down, Catch‑up, Ready, Prepare for rebalance, and Rebalancing.

Coordinator Failover

If a coordinator fails at any stage of the rebalance protocol, the new coordinator re‑reads metadata from ZooKeeper and may initiate a fresh rebalance, ensuring the group eventually reaches a consistent assignment.

Upcoming articles will cover Kafka performance testing methods and reports.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

KafkaConsumerrebalanceHigh-Level ConsumerLow-Level Consumer
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.