Kafka Transactions, Replication Issues, HW/LEO Evolution, and Reliability Mechanisms
This article explains how Kafka implements transactions, handles under‑replicated partitions, manages high‑watermark and log‑end‑offset evolution, uses leader epochs for consistency, discusses read‑committed isolation, explains why read‑write separation is not supported, and describes delay queues, dead‑letter/retry queues, auditing, tracing, lag calculation, key metrics, and performance‑optimising design features.
How Kafka Implements Transactions
Kafka transactions allow an application to treat consuming messages, producing messages, and committing consumer offsets as a single atomic operation that either fully succeeds or fails, even when spanning multiple partitions.
The producer must provide a unique transactionalId; upon start it requests a PID from the transaction coordinator, establishing a one‑to‑one mapping between transactionalId and PID.
Before sending data to a <Topic, Partition>, the producer sends an AddPartitionsToTxnRequest to the transaction coordinator, which records the <Transaction, Topic, Partition> in the internal __transaction_state topic and sets its state to BEGIN.
After processing AddOffsetsToTxnRequest, the producer also sends a TxnOffsetCommitRequest to the GroupCoordinator, storing the offsets involved in the transaction in the __consumer_offsets topic.
When all writes are complete, the application calls commitTransaction() or abortTransaction(). The producer then sends an EndTxnRequest to the TransactionCoordinator, which writes a PREPARE_COMMIT or PREPARE_ABORT record to __transaction_state, writes the final COMMIT or ABORT markers to the user topics and __consumer_offsets, and finally writes a COMPLETE_COMMIT or COMPLETE_ABORT record to indicate the transaction has finished.
Consumers set isolation.level=read_committed to hide uncommitted messages; such messages are cached inside the consumer until the transaction is committed, or discarded if the transaction is aborted.
What Are Under‑Replicated Partitions and How to Handle Them?
Normally all replicas of a partition belong to the ISR set, but replicas that fall behind the leader beyond configured thresholds are removed from ISR and become under‑replicated (or out‑of‑sync) replicas.
Since Kafka 0.9, the broker parameter replica.lag.time.max.ms (default 10 000 ms) determines when a follower lagging in time is considered out‑of‑sync. Earlier versions also used replica.lag.max.messages (default 4000) to detect lag based on message count; the union of both criteria defines the set of failed replicas.
Typical causes of replica failure include a stuck follower process (e.g., long Full GC), slow I/O causing the follower to fall behind, newly added replicas that have not yet caught up, and followers that go offline and later re‑join before synchronising.
Mitigation Measures
The metric UnderReplicatedPartitions counts partitions whose leader has at least one out‑of‑sync follower. A stable non‑zero value across many brokers usually indicates a broker is permanently down, often due to hardware, OS, or JVM issues.
Frequent fluctuations of UnderReplicatedPartitions without broker loss suggest performance problems; investigators should narrow the scope to a specific broker and examine OS, GC, network, or disk metrics (e.g., iowait, ioutil).
HW and LEO Evolution Across Replicas
Consider a partition with three replicas on broker0, broker1, and broker2. broker0 hosts the leader; the others are followers. The message flow is:
Producer sends messages to the leader.
Leader appends messages to its local log and updates its Log End Offset (LEO).
Followers request data from the leader.
Leader reads its log and sends the requested data to each follower.
Followers append the received messages to their logs and update their own LEO.
Initially the leader’s LEO may be 5 while all replicas have High Watermark (HW) 0. As followers fetch data, they receive the leader’s HW in the FetchResponse and set their own HW to the minimum of their LEO and the leader’s HW. Subsequent fetches cause the leader to recompute HW as the minimum LEO among ISR replicas, which then propagates back to followers.
Reliability Improvements: HW and Leader Epoch
High Watermark (HW)
HW marks the highest offset that is guaranteed to be replicated to all ISR replicas; consumers can only read up to HW.
HW equals the smallest LEO among ISR replicas.
Leader Epoch
Leader epoch is a monotonically increasing number (starting at 0) that changes whenever a new leader is elected. Each replica maintains a vector <LeaderEpoch => StartOffset> indicating the first offset written under a given epoch.
When a follower restarts, it sends an OffsetsForLeaderEpochRequest to the current leader. If the follower’s epoch differs, the leader returns the start offset for the next epoch, allowing the follower to truncate its log safely.
Leader‑epoch information also resolves data‑inconsistency scenarios after simultaneous failures, ensuring that only the log entries belonging to the latest epoch are retained.
Why Kafka Does Not Support Read‑Write Separation
Separating reads to followers introduces consistency windows where followers lag behind the leader, and adds latency due to the extra network‑disk‑memory hops. Kafka instead distributes leader partitions evenly across brokers, balancing read and write load without a dedicated read‑only replica.
Implementation of Delay Queues in Kafka
Delayed messages are first written to internal “delay” topics that are invisible to users. A custom DelayService process consumes these topics, holds messages in a java.util.concurrent.DelayQueue until their scheduled delivery time, and then forwards them to the real target topic.
Each delay level (e.g., 5 s, 10 s, 30 s, 1 min, etc.) has its own internal topic and dedicated fetch and dispatch threads, ensuring ordered delivery and preventing memory overload by throttling fetches when a partition’s pending count exceeds a threshold.
Dead‑Letter and Retry Queues
Failed messages can be routed to a retry queue with increasing back‑off intervals (e.g., Q1 = 5 s, Q2 = 10 s). After exceeding the maximum retry count, messages are moved to a dead‑letter queue for manual inspection.
Message Auditing
Auditing records timestamps or globally unique IDs in the message payload or headers, enabling detection of loss, duplication, and end‑to‑end latency. Tools such as Uber’s Chaperone, Confluent Control Center, and LinkedIn’s Kafka Monitor rely on these embedded fields.
Message Tracing
Tracing captures the full lifecycle of a message—production, broker storage, and consumption—by appending trace records to a dedicated trace_topic. Consumers and producers can later query this topic to reconstruct the message’s path.
Lag Calculation (read_uncommitted vs. read_committed)
With isolation.level=read_uncommitted, lag = HW – ConsumerOffset. With read_committed, lag = LSO (Last Stable Offset) – ConsumerOffset. The calculation involves fetching group metadata, consumer offsets, and the appropriate high‑watermark or LSO via DescribeGroupsRequest, OffsetFetchRequest, and ListOffsetRequest.
Key Kafka Metrics to Monitor
BytesIn/BytesOut – inbound/outbound traffic per broker.
NetworkProcessorAvgIdlePercent – network thread pool idle ratio (should stay >30%).
RequestHandlerAvgIdlePercent – I/O thread pool idle ratio (should stay >30%).
UnderReplicatedPartitions – number of partitions lacking full replication.
ISRShrink/ISRExpand – frequency of ISR membership changes.
ActiveControllerCount – should be 1 on the controller broker; >1 indicates a split‑brain.
Design Features That Give Kafka High Performance
Partitioning – distributes load across many brokers and enables parallel consumption.
Batching and End‑to‑End Compression – reduces network overhead.
Sequential Disk I/O – appends to log files for fast reads/writes.
Zero‑Copy Transfer – uses sendfile (or FileChannel.transferTo) to move data from disk to network without extra copies.
Log Segmentation – splits large logs into manageable segments.
Sparse Index Files – store index entries every configurable byte interval, balancing lookup speed and index size.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
