Big Data 7 min read

Understanding Flink's Exactly-Once Guarantees: Checkpoint, Two‑Phase Commit, and Kafka Integration

This article explains how Apache Flink achieves end‑to‑end exactly‑once semantics by using source replay support, checkpoint‑based snapshots, asynchronous incremental checkpoints, and two‑phase commit sinks, and describes the interaction with external systems such as Kafka to ensure transactional writes.

DataFunTalk
DataFunTalk
DataFunTalk
Understanding Flink's Exactly-Once Guarantees: Checkpoint, Two‑Phase Commit, and Kafka Integration

Background

Since version 1.4.0 Flink provides exactly‑once data guarantees, ensuring that each record influences the application exactly once, neither more nor less.

To achieve end‑to‑end exactly‑once, Flink requires:

Source side support for data replay.

Checkpoint mechanisms inside Flink.

Sink side idempotent or transactional writes to avoid duplicate records after failure recovery.

Checkpoint

Flink uses a distributed snapshot mechanism based on checkpoints, allowing a job to recover from the latest snapshot after a fail‑over, thereby preserving exactly‑once processing internally.

The core components of a Flink checkpoint are:

Barrier : an ordered, lightweight marker that flows with the data stream, carrying a unique ID.

Asynchronous : snapshots are stored asynchronously to avoid blocking the data processing pipeline.

Incremental : instead of full snapshots (which can be gigabytes or terabytes), Flink creates incremental snapshots that build on the previous checkpoint.

Transactional Write

Core Idea : A transaction is tied to a checkpoint; only when the checkpoint completes are the results committed to the sink.

Implementation :

Write‑Ahead Log (WAL): results are first stored as state and flushed to the sink after checkpoint completion. This approach cannot guarantee true exactly‑once if a failure occurs mid‑write.

Two‑Phase Commit (2PC): for each checkpoint, the sink starts a transaction, buffers incoming data, pre‑commits (writes without final commit), and finally commits when the checkpoint is acknowledged. This requires an external sink that supports transactions.

Flink provides the abstract class TwoPhaseCommitSinkFunction , where developers implement beginTransaction , preCommit , commit , and abort to achieve exactly‑once semantics.

beginTransaction : creates a temporary file in the target filesystem.

preCommit : flushes buffered data to the temporary file and closes it.

commit : moves the temporary file to the final destination.

abort : deletes the temporary file on failure.

Flink‑Kafka Exactly‑Once

Although Flink’s asynchronous snapshot mechanism and two‑phase commit provide end‑to‑end exactly‑once semantics, the external systems (e.g., Kafka) must also support exactly‑once guarantees.

End‑to‑end exactly‑once means that data from the source to the sink must be processed exactly once, which requires both Flink’s distributed snapshots and two‑phase commit, plus transactional support from the external sink.

The process consists of four stages:

When Flink initiates a checkpoint, a pre‑commit phase starts and the JobManager injects a barrier into the data stream.

All barriers propagate through operators and complete the snapshot, finishing the pre‑commit phase.

After all operators pre‑commit, a commit action is triggered; any pre‑commit failure causes a rollback to the latest checkpoint.

The commit must succeed for the pre‑commit to be considered successful.

Key operators in this workflow are:

SourceOperator: consumes messages from Kafka and records offsets.

TransformationOperator: processes data and participates in checkpoints.

SinkOperator: writes results back to Kafka.

Big DataFlinkstream processingKafkaCheckpointexactly-onceTwo-Phase Commit
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.