Big Data 15 min read

Understanding Fault Tolerance and Exactly-Once Semantics in Apache Flink

This article explains Apache Flink's fault‑tolerance mechanisms, including checkpointing, barrier alignment, the differences between At‑Least‑Once and Exactly‑Once semantics, configuration options, incremental checkpointing, and the requirements for external sources and sinks to achieve end‑to‑end exactly‑once processing.

Big Data Technology & Architecture

Mar 13, 2019

Understanding Fault Tolerance and Exactly-Once Semantics in Apache Flink

In streaming scenarios, data continuously flows into Apache Flink, triggering computation for each record. When a task fails due to network or machine issues, Flink relies on its state and fault‑tolerance mechanisms to recover.

Fault tolerance means the system can automatically detect failures and resume normal operation without losing correctness. Flink achieves this by periodically creating distributed snapshots of the data stream and its state through a process called Checkpointing .

Checkpointing works by inserting special barriers into the data stream. Sources emit barriers, which travel downstream; when all upstream barriers for a checkpoint align at an operator, the operator takes a snapshot of its state and forwards the barrier further. This alignment ensures a consistent global state across the directed acyclic graph (DAG).

The snapshots are stored in a StateBackend and can be used to restore the job from the last successful checkpoint during a failover.

Flink supports two processing guarantees:

At‑Least‑Once – every record is processed at least once (no data loss, but possible duplicates).

Exactly‑Once – every record is processed exactly once (no loss and no duplication), which incurs higher latency.

The difference lies in how early‑arriving events are handled before barrier alignment. In Exactly‑Once mode, early events are buffered until all barriers arrive, increasing latency but preventing duplicate processing. In At‑Least‑Once mode, events are processed immediately, reducing latency but allowing duplicates.

Implementation details involve two classes that implement CheckpointBarrierHandler:

public interface CheckpointBarrierHandler {
    ...
    BufferOrEvent getNextNonBlocked() throws Exception;
    ...
}

• BarrierBuffer (Exactly‑Once) blocks input channels until all barriers align, buffering events in the meantime.

• BarrierTracker (At‑Least‑Once) simply tracks barrier arrival without blocking, allowing immediate processing.

Configuration parameters for checkpointing include: checkpointMode – AT_LEAST_ONCE or EXACTLY_ONCE checkpointInterval – interval in milliseconds checkpointTimeout – timeout in milliseconds

Externalized checkpoint cleanup options: RETAIN_ON_CANCELLATION or DELETE_ON_CANCELLATION Incremental Checkpointing reduces the size of checkpoint files by only persisting changes since the previous checkpoint, which is crucial for long‑running jobs with large state.

End‑to‑end Exactly‑Once also requires support from external sources (to provide a recoverable position) and sinks (typically using a two‑phase commit protocol, e.g., Kafka).

The overall checkpointing workflow is illustrated by the following diagram:

Additional diagrams show barrier alignment in a word‑count job and the incremental checkpointing process.

Understanding these mechanisms helps developers design robust Flink applications and prepares them for deeper topics such as windowing and retraction.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Apache Flink fault tolerance Exactly-once checkpointing

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.