Operations 15 min read

Mastering Kafka ISR: How In‑Sync Replicas Ensure Consistency and High Availability

This article explains Kafka's In‑Sync Replicas (ISR) mechanism, detailing its definitions, dynamic scaling, interaction with High Watermark, extreme unclean leader election scenarios, and practical tuning and troubleshooting tips for maintaining strong consistency and high availability in production clusters.

Cognitive Technology Team
Cognitive Technology Team
Cognitive Technology Team
Mastering Kafka ISR: How In‑Sync Replicas Ensure Consistency and High Availability

Why ISR Is Needed

Early Kafka versions used a simple leader‑follower model where the leader handled reads/writes and followers pulled data. In failure cases such as network partitions or broker crashes, this could lead to permanent data loss, as illustrated by a scenario where a leader crashes before a follower has replicated a message.

Core Concepts

AR (Assigned Replicas) : All replicas configured for a partition, defined by replication.factor.

ISR (In‑Sync Replicas) : Replicas that are fully caught up with the leader within an allowed lag; the leader is always part of ISR.

OSR (Out‑of‑Sync Replicas) : Replicas that fall behind due to network delay, GC pauses, or high load and are temporarily removed from ISR.

HW (High Watermark) : The smallest Log End Offset (LEO) among all ISR members, representing the offset up to which messages are considered committed.

Key Points

The leader has a special status: it must always be in ISR; if it leaves ISR, the partition becomes unavailable.

ISR is dynamic: replicas are added or removed based on real‑time sync performance.

Dynamic Scaling of ISR (Shrink & Expand)

1. When Is a Replica Kicked Out?

Kafka determines sync status based on time thresholds. In versions before 0.10.x, replica.lag.time.max.ms (default 10 s) was used; later versions rely solely on time, removing message‑count based checks.

If lastCatchUpTime exceeds replica.lag.time.max.ms, the replica becomes OSR.

Kafka also checks lastFetchTime and compares the lag to the threshold.

2. Shrink Process

Leader’s broker triggers an ISR update when a follower is out of sync.

The out‑of‑sync follower is moved from ISR to OSR.

The leader sends the new ISR list to the Controller.

The Controller broadcasts the updated metadata to all brokers and ZooKeeper/KRaft.

If producers use acks=all (or -1), the leader only waits for acknowledgments from the current ISR, preventing stalls caused by a slow follower.

3. Expand Process

The lagging follower catches up to the leader’s LEO.

The leader detects the follower is in sync and adds it back to ISR.

The Controller again broadcasts the refreshed ISR list.

ISR and High Watermark (HW) Collaboration

HW is the minimum LEO among ISR members. Offsets below HW are considered committed and visible to consumers; offsets above HW are uncommitted even if the leader has written them.

Workflow Example

Producer writes a message at offset 100; leader advances LEO to 101.

Followers pull the data; one follower stays at LEO 90 due to network delay.

If the lag exceeds replica.lag.time.max.ms, the slow follower is removed from ISR, reducing the ISR set.

HW is recomputed based on the remaining ISR, marking offset 100 as committed.

In a failover, the controller elects a new leader from the remaining ISR, ensuring no data loss.

Extreme Scenario: Unclean Leader Election

When the leader crashes and ISR becomes empty, Kafka can either forbid or allow unclean leader election via unclean.leader.election.enable.

Option A – Disable (default)

Cluster refuses to elect a non‑ISR replica as leader.

Partition stays unavailable until a valid ISR member returns.

Pros: zero data loss, strong consistency.

Cons: reduced availability during severe failures.

Option B – Enable

Cluster may elect a lagging OSR replica as leader.

Partition quickly becomes available.

Cons: possible data loss and rollback of messages not yet replicated to ISR.

Best practice: keep unclean.leader.election.enable=false in most production environments; prioritize fixing the root cause of frequent ISR shrinkage rather than sacrificing consistency.

Production Tuning & Troubleshooting

Key Monitoring Metrics

UnderReplicatedPartitions : partitions where ISR count < AR count.

OfflinePartitionsCount : partitions without a leader.

ReplicaMaxLagTime : maximum lag of any follower.

NetworkProcessorAvgIdlePercent and RequestHandlerAvgIdlePercent : indicate network and I/O thread saturation.

Common Causes of ISR Shrinkage

Full GC : long stop‑the‑world pauses; check gc.log and compare pause duration to replica.lag.time.max.ms. Mitigate by tuning JVM heap and using G1/ZGC.

Disk I/O Bottleneck : slow follower writes; monitor with iostat (%util, await) and broker logs for slow fetches. Remedy by using SSDs, adding disks, or increasing num.io.threads.

Insufficient Network Bandwidth : cross‑datacenter replication or traffic spikes; monitor with iftop, nload and increase bandwidth or improve rack awareness.

High CPU Load : compression or encryption overhead; identify with top and consider lighter codecs (e.g., snappy) or disable unnecessary encryption.

Configuration Mismatch : differing message.max.bytes or other settings across brokers; align server.properties across the cluster.

Parameter Tuning Recommendations

replica.lag.time.max.ms

: default 10000 ms; lower for low‑latency systems, raise for cross‑region replication. min.insync.replicas: default 1; set to 2 with replication.factor=3 to tolerate one broker failure while preserving durability. unclean.leader.election.enable: default false; keep disabled unless business explicitly tolerates data loss.

Conclusion & Outlook

Kafka’s ISR mechanism is the core of its distributed consistency, balancing the CAP trade‑off between consistency and availability. In normal operation ISR guarantees strong consistency via HW; during partial failures ISR shrinkage maintains availability; in extreme cases unclean leader election lets operators choose between data loss and service downtime. With the shift to KRaft (removing ZooKeeper), metadata handling becomes more efficient, but ISR logic remains unchanged, making it essential knowledge for Kafka operators and architects.

high availabilityKafkaPerformance TuningreplicationISR
Cognitive Technology Team
Written by

Cognitive Technology Team

Cognitive Technology Team regularly delivers the latest IT news, original content, programming tutorials and experience sharing, with daily perks awaiting you.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.