Big Data 12 min read

Static Members and Incremental Cooperative Rebalancing in Apache Kafka

Apache Kafka 2.3 introduced static members and incremental cooperative rebalancing to reduce disruptive global rebalances, allowing workers to retain assignments during failures, schedule delayed rebalances, and improve scalability for Kafka Connect clusters, balancing availability and fault tolerance.

Architects Research Society
Architects Research Society
Architects Research Society
Static Members and Incremental Cooperative Rebalancing in Apache Kafka

Static Members

To reduce temporary failures that cause consumer rebalances, Apache Kafka 2.3 introduced the concept of static members in KIP‑345.

The core idea is that each consumer instance is attached to a unique identifier configured via group.instance.id . The membership protocol was extended so that this ID is propagated to the broker coordinator through JoinGroup requests.

If a consumer is restarted or terminated due to a transient fault, the broker coordinator will not notify other consumers to rebalance until session.timeout expires. One reason is that a stopped consumer does not send a LeaveGroup request.

When the consumer finally rejoins the group, the broker coordinator returns the cached assignment without performing any further rebalance.

When using static membership, it is recommended to increase the consumer session.timeout so that the broker coordinator does not trigger overly frequent rebalances.

Static membership helps limit the number of unwanted rebalances, thereby minimizing the “stop‑the‑world” impact. The downside is increased partition unavailability because the coordinator may only detect a failed consumer after several minutes (depending on session.timeout.ms ). This is the classic trade‑off between availability and fault‑tolerance in distributed systems.

Incremental Cooperative Rebalancing

Starting with version 2.3, Apache Kafka also introduced a new embedded protocol to improve per‑member resource availability while minimizing stop‑the‑world effects.

The basic idea is to perform rebalancing incrementally and cooperatively – i.e., multiple small rebalances instead of a single global rebalance.

Incremental cooperative rebalancing was first implemented for Kafka Connect via KIP‑415 (partially available in Kafka 2.3). Users of Kafka 2.4 and KIP‑429 can also use it.

Kafka Connect Limitations

Kafka Connect uses the group membership protocol to evenly assign connectors and tasks across workers in a Connect cluster. When a node fails/restarts, tasks are added/removed, or configurations are updated, workers coordinate to rebalance connectors and tasks.

Prior to Kafka 2.3, any of these events caused all existing connectors to stop (a stop‑the‑world scenario), making it hard to scale clusters with dozens of connectors.

Incremental cooperative rebalancing tries to solve this problem in two ways:

Only stop tasks/members for revoked resources.

Handle temporary resource‑allocation imbalances between members, either immediately or with a delay (useful for rolling restarts).

In practice, the incremental cooperative rebalancing principle reduces to three concrete designs:

Simple cooperative rebalance.

Delayed resolution of imbalances.

Incremental resolution of imbalances.

To illustrate how design II (delayed resolution) works, we walk through a Kafka Connect example.

Delayed Resolution of Imbalances

Consider a simple Connect cluster with three workers and the initial task/connector assignment shown below:

1 – Initial assignment

Assume worker W2 fails without a specific reason and leaves the group after a session timeout. A rebalance is triggered and the remaining workers W1 and W3 re‑join the group, sending JoinGroup requests that include their previous assignments. The existing member_metadata field of the group membership protocol shares these assignments.

2 – W2 leaves, triggering rebalance (W1, W3 join)

W1 is elected as the group leader and computes the new task/connector assignment by comparing with the previous assignment. The leader detects tasks and connectors that were not present in the prior assignment.

3 – W1 becomes leader and calculates tasks

W1 sends the new assignment, including revoked tasks. Notice that W1 does **not** immediately try to resolve the missing assignments; instead, it postpones the solution and schedules the next rebalance, giving the failed member a chance to reappear. The delay is controlled by the new configuration scheduled.rebalance.max.delay.ms (default 5 minutes).

Note: With incremental cooperative rebalance, when a member receives a new assignment it starts processing any newly assigned partitions (or tasks/connectors). If the assignment also contains revoked partitions, the member stops processing them, commits offsets, and immediately joins the group again. This increases the number of rebalances but only stops the resources whose assignment changed.

4 – W1 and W3 receive tasks

Before the delay expires, W2 rejoins the group, triggering another rebalance. W1 and W2 re‑join the group.

5 – W2 rejoins before delay expires, triggering rebalance

However, before the scheduled rebalance delay expires, W1 does not reassign the lost tasks/connectors.

6 – W1 remains leader and recalculates tasks

After the remaining delay expires, a final rebalance is triggered and all workers re‑join the group.

7 – W1, W2, W3 receive tasks

Finally, the leader reassigns A‑Task‑1 and Connector‑B to W2. Throughout all rebalances, W1 and W3 never stopped processing their assigned tasks.

8 – After delay, all members join

Conclusion

The rebalance protocol is a crucial component of Apache Kafka’s consumer mechanism, but it can also serve as a generic protocol for coordinating group members and distributing resources among them (e.g., Kafka Connect). Static membership and incremental cooperative rebalancing are important features that make the Kafka protocol more robust and scalable, delivering significant improvements.

For more details on the rebalance protocol and its operation, see the links below.

KIP‑429, KIP‑415, KIP‑453

https://cwiki.apache.org/confluence/display/KAFKA/Incremental+Cooperative+Rebalancing%3A+Support+and+Policies

https://kafka.apache.org/protocol

The Magical Rebalance Protocol of Apache Kafka by Gwen Shapira

https://www.slideshare.net/ConfluentInc/everything-you-always-wanted-to-know-about-kafkas-rebalance-protocol-but-were-afraid-to-ask-matthias-j-sax-confluent-kafka-summit-london-2019

https://www.slideshare.net/ConfluentInc/rebalance-protocol-insideout-a-developer-perspective

Original article: https://medium.com/streamthoughts/apache-kafka-rebalance-protocol-or-the-magic-behind-your-streams-applications-e94baf68e4f2

Translated article: http://jiagoushi.pro/node/1114

Discussion: join the Knowledge Planet “Chief Architect Circle” or the small account “jiagoushi_pro”.

distributed systemsApache KafkaKafka ConnectIncremental RebalancingStatic Members
Architects Research Society
Written by

Architects Research Society

A daily treasure trove for architects, expanding your view and depth. We share enterprise, business, application, data, technology, and security architecture, discuss frameworks, planning, governance, standards, and implementation, and explore emerging styles such as microservices, event‑driven, micro‑frontend, big data, data warehousing, IoT, and AI architecture.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.