Static Members and Incremental Cooperative Rebalancing in Apache Kafka
Apache Kafka 2.3 introduced static members and incremental cooperative rebalancing to reduce disruptive global rebalances, allowing workers to retain assignments during failures, schedule delayed rebalances, and improve scalability for Kafka Connect clusters, balancing availability and fault tolerance.
Static Members
To reduce temporary failures that cause consumer rebalances, Apache Kafka 2.3 introduced the concept of static members in KIP‑345.
The core idea is that each consumer instance is attached to a unique identifier configured via group.instance.id . The membership protocol was extended so that this ID is propagated to the broker coordinator through JoinGroup requests.
If a consumer is restarted or terminated due to a transient fault, the broker coordinator will not notify other consumers to rebalance until session.timeout expires. One reason is that a stopped consumer does not send a LeaveGroup request.
When the consumer finally rejoins the group, the broker coordinator returns the cached assignment without performing any further rebalance.
When using static membership, it is recommended to increase the consumer session.timeout so that the broker coordinator does not trigger overly frequent rebalances.
Static membership helps limit the number of unwanted rebalances, thereby minimizing the “stop‑the‑world” impact. The downside is increased partition unavailability because the coordinator may only detect a failed consumer after several minutes (depending on session.timeout.ms ). This is the classic trade‑off between availability and fault‑tolerance in distributed systems.
Incremental Cooperative Rebalancing
Starting with version 2.3, Apache Kafka also introduced a new embedded protocol to improve per‑member resource availability while minimizing stop‑the‑world effects.
The basic idea is to perform rebalancing incrementally and cooperatively – i.e., multiple small rebalances instead of a single global rebalance.
Incremental cooperative rebalancing was first implemented for Kafka Connect via KIP‑415 (partially available in Kafka 2.3). Users of Kafka 2.4 and KIP‑429 can also use it.
Kafka Connect Limitations
Kafka Connect uses the group membership protocol to evenly assign connectors and tasks across workers in a Connect cluster. When a node fails/restarts, tasks are added/removed, or configurations are updated, workers coordinate to rebalance connectors and tasks.
Prior to Kafka 2.3, any of these events caused all existing connectors to stop (a stop‑the‑world scenario), making it hard to scale clusters with dozens of connectors.
Incremental cooperative rebalancing tries to solve this problem in two ways:
Only stop tasks/members for revoked resources.
Handle temporary resource‑allocation imbalances between members, either immediately or with a delay (useful for rolling restarts).
In practice, the incremental cooperative rebalancing principle reduces to three concrete designs:
Simple cooperative rebalance.
Delayed resolution of imbalances.
Incremental resolution of imbalances.
To illustrate how design II (delayed resolution) works, we walk through a Kafka Connect example.
Delayed Resolution of Imbalances
Consider a simple Connect cluster with three workers and the initial task/connector assignment shown below:
1 – Initial assignment
Assume worker W2 fails without a specific reason and leaves the group after a session timeout. A rebalance is triggered and the remaining workers W1 and W3 re‑join the group, sending JoinGroup requests that include their previous assignments. The existing member_metadata field of the group membership protocol shares these assignments.
2 – W2 leaves, triggering rebalance (W1, W3 join)
W1 is elected as the group leader and computes the new task/connector assignment by comparing with the previous assignment. The leader detects tasks and connectors that were not present in the prior assignment.
3 – W1 becomes leader and calculates tasks
W1 sends the new assignment, including revoked tasks. Notice that W1 does **not** immediately try to resolve the missing assignments; instead, it postpones the solution and schedules the next rebalance, giving the failed member a chance to reappear. The delay is controlled by the new configuration scheduled.rebalance.max.delay.ms (default 5 minutes).
Note: With incremental cooperative rebalance, when a member receives a new assignment it starts processing any newly assigned partitions (or tasks/connectors). If the assignment also contains revoked partitions, the member stops processing them, commits offsets, and immediately joins the group again. This increases the number of rebalances but only stops the resources whose assignment changed.
4 – W1 and W3 receive tasks
Before the delay expires, W2 rejoins the group, triggering another rebalance. W1 and W2 re‑join the group.
5 – W2 rejoins before delay expires, triggering rebalance
However, before the scheduled rebalance delay expires, W1 does not reassign the lost tasks/connectors.
6 – W1 remains leader and recalculates tasks
After the remaining delay expires, a final rebalance is triggered and all workers re‑join the group.
7 – W1, W2, W3 receive tasks
Finally, the leader reassigns A‑Task‑1 and Connector‑B to W2. Throughout all rebalances, W1 and W3 never stopped processing their assigned tasks.
8 – After delay, all members join
Conclusion
The rebalance protocol is a crucial component of Apache Kafka’s consumer mechanism, but it can also serve as a generic protocol for coordinating group members and distributing resources among them (e.g., Kafka Connect). Static membership and incremental cooperative rebalancing are important features that make the Kafka protocol more robust and scalable, delivering significant improvements.
For more details on the rebalance protocol and its operation, see the links below.
KIP‑429, KIP‑415, KIP‑453
https://cwiki.apache.org/confluence/display/KAFKA/Incremental+Cooperative+Rebalancing%3A+Support+and+Policies
https://kafka.apache.org/protocol
The Magical Rebalance Protocol of Apache Kafka by Gwen Shapira
https://www.slideshare.net/ConfluentInc/everything-you-always-wanted-to-know-about-kafkas-rebalance-protocol-but-were-afraid-to-ask-matthias-j-sax-confluent-kafka-summit-london-2019
https://www.slideshare.net/ConfluentInc/rebalance-protocol-insideout-a-developer-perspective
Original article: https://medium.com/streamthoughts/apache-kafka-rebalance-protocol-or-the-magic-behind-your-streams-applications-e94baf68e4f2
Translated article: http://jiagoushi.pro/node/1114
Discussion: join the Knowledge Planet “Chief Architect Circle” or the small account “jiagoushi_pro”.
Architects Research Society
A daily treasure trove for architects, expanding your view and depth. We share enterprise, business, application, data, technology, and security architecture, discuss frameworks, planning, governance, standards, and implementation, and explore emerging styles such as microservices, event‑driven, micro‑frontend, big data, data warehousing, IoT, and AI architecture.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.