Big Data 42 min read

Kafka Transactions: Implementation Details and End‑to‑End Process

This article explains how Apache Kafka implements transactional messaging, covering the atomic write guarantees across partitions, the role of TransactionCoordinator, transaction state management, producer and consumer handling, code examples, and the end‑to‑end workflow for exactly‑once semantics.

Big Data Technology & Architecture
Big Data Technology & Architecture
Big Data Technology & Architecture
Kafka Transactions: Implementation Details and End‑to‑End Process

Overview

Kafka transactions provide atomic writes across multiple partitions, enabling exactly‑once semantics (EOS) that go beyond the per‑partition guarantees of the idempotent producer. The implementation is based on a TransactionCoordinator, transaction logs, and a series of state transitions.

Exactly‑Once Guarantees

Idempotent Producer – exactly‑once, in‑order delivery per partition.

Transactions – atomic writes across partitions.

Exactly‑once stream processing – end‑to‑end EOS for consume‑process‑write pipelines.

TransactionCoordinator

The TransactionCoordinator (TC) acts like a 2PC coordinator. It receives requests such as FindCoordinator, InitProducerId, AddPartitionsToTxn, AddOffsetsToTxn, and EndTxn. TC stores transaction metadata in the internal topic __transaction_state, persists state changes, and sends Transaction Marker messages to partition leaders.

Transaction Log (__transaction_state)

All transaction metadata (transactionalId, producerId, epoch, status, involved partitions, timestamps) is serialized into __transaction_state. This compacted topic allows a new TC to recover the full state after a leader change.

State Machine

Transaction states include Empty, Ongoing, PrepareCommit, PrepareAbort, CompleteCommit, CompleteAbort, Dead, and PrepareEpochFence. Normal flow:

Empty → Ongoing → PrepareCommit → CompleteCommit → Empty

(or abort path).

Producer API Workflow

Configure transactional.id and call initTransactions() to obtain a PID.

Start a transaction with beginTransaction().

Send records; the producer adds partitions to the transaction via AddPartitionsToTxn.

Optionally call sendOffsetsToTransaction() for consume‑process‑produce scenarios.

Commit or abort with commitTransaction() / abortTransaction(), which triggers EndTxn and the TC’s marker writes.

Properties props = new Properties();
props.put("bootstrap.servers", "localhost:9092");
props.put("transactional.id", "test-transactional");
KafkaProducer producer = new KafkaProducer<>(props);
producer.initTransactions();
producer.beginTransaction();
producer.send(new ProducerRecord(topic, "0", msg));
producer.commitTransaction();

Consumer Handling

Consumers use the isolation.level setting. With read_committed, the broker only returns records up to the Last Stable Offset (LSO), which is the highest offset where all prior transactional records are known to be committed or aborted. Abort markers are stored in per‑segment .txnindex files, allowing the broker to filter aborted data without client‑side buffering.

Fencing Mechanisms

Both TransactionCoordinator and producers employ fencing via epochs. A newer coordinator or producer with a higher epoch invalidates older instances, causing ProducerFencedException on stale producers.

Snapshots and Recovery

Each partition periodically writes a PID snapshot ( .snapshot) containing the latest producerId, epoch, and sequence numbers. After a broker restart, the snapshot plus the log enables fast recovery of idempotent state.

Failure Scenarios

Producer retries on beginTransaction or commitTransaction failures.

Coordinator failures are handled by replaying __transaction_state and completing pending PrepareCommit / PrepareAbort phases.

Transaction time‑outs trigger automatic aborts.

Conclusion

The combination of idempotent producers, a dedicated TransactionCoordinator, transaction logs, LSO, and fencing provides Kafka with robust exactly‑once guarantees across partitions, enabling reliable stream processing frameworks such as Kafka Streams and Flink.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Distributed SystemsKafkaConsumerTransactionsProducer
Big Data Technology & Architecture
Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.