Big Data 4 min read

Mastering Kafka High Availability: Replication, Leader‑Follower, ISR, and Ack Strategies

This article explains Kafka's high‑availability architecture, covering multi‑replica replication, leader‑follower election and failover, the role of In‑Sync Replicas, and producer acknowledgment settings with min.insync.replicas for reliable, zero‑data‑loss streaming.

Mike Chen's Internet Architecture
Mike Chen's Internet Architecture
Mike Chen's Internet Architecture
Mastering Kafka High Availability: Replication, Leader‑Follower, ISR, and Ack Strategies

Replication Mechanism

Kafka creates multiple replicas for each partition and distributes them across different brokers, eliminating single points of failure. Each partition has a single Leader that handles all client read/write requests, while Followers act as backups and asynchronously replicate the log.

Leader‑Follower Election and Failover

Kafka uses a “single‑write, multiple‑read” model where the Leader bears the load and Followers provide redundancy, simplifying consistency protocols. When a broker fails, the control plane (ZooKeeper or KRaft) triggers an automatic election, selecting a new Leader from the In‑Sync Replicas (ISR) list, ensuring transparent failover for clients.

In‑Sync Replicas (ISR)

ISR is the core design that guarantees data consistency. The Leader maintains a dynamic set of replicas that are fully synchronized. If a follower’s lag exceeds replica.lag.time.max.ms, it is removed from ISR; once it catches up, it rejoins. Only replicas in ISR are eligible to become the new Leader, balancing strong synchronization with performance.

Producer ACK Strategy and min.insync.replicas

To achieve true zero‑data‑loss, producers must configure acknowledgment settings and the min.insync.replicas parameter. acks=0/1 favors performance but risks loss, while acks=all (or -1) requires the message to be written to all ISR replicas before acknowledgment. min.insync.replicas sets a safety threshold at the topic level; if the number of ISR replicas falls below this value, the broker rejects writes, preventing data loss when only a single replica remains. The common best practice is to set it to “replication factor ‑ 1” or at least 2.

big datahigh availabilityKafkareplicationISRAck Strategy
Mike Chen's Internet Architecture
Written by

Mike Chen's Internet Architecture

Over ten years of BAT architecture experience, shared generously!

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.