Understanding Flink's Asynchronous Barrier Snapshot (ABS) Algorithm for Checkpointing
This article explains how Apache Flink implements fault‑tolerant checkpointing using the Asynchronous Barrier Snapshot (ABS) algorithm, a localized version of the Chandy‑Lamport distributed snapshot, covering barriers, snapshot alignment, exactly‑once versus at‑least‑once semantics, and handling of cyclic dataflow graphs.
Previously the author briefly introduced the Chandy‑Lamport distributed snapshot algorithm; Flink's checkpointing relies on a localized version called the Asynchronous Barrier Snapshot (ABS) algorithm, proposed by Stephen Ewen, Kostas Tzoumas and others in the paper "Lightweight Asynchronous Snapshots for Distributed Dataflows".
Checkpoint & Snapshot
Checkpoints provide fault‑tolerance for Flink streaming jobs. When a failure occurs, Flink restarts the affected operators and restores them to the state captured by the last successful checkpoint, which is essentially a snapshot of the job at a specific moment.
A Flink job can be modeled as a directed graph where vertices are operators and edges are data streams, matching the process‑link model of the original Chandy‑Lamport algorithm.
Each snapshot must contain both the state of each operator and the data carried by the streams.
Operator state is easy to record, but stream state is hard because of its large volume.
Time cannot be stopped; snapshots must avoid a stop‑the‑world approach to prevent latency spikes and throughput loss.
The solution has two key ideas: (1) each operator records its own state to form a global snapshot, and (2) a marker (the checkpoint barrier) partitions the stream into time‑bounded segments.
Barrier
Similar to the marker message in Chandy‑Lamport, Flink uses a checkpoint barrier generated periodically by the JobManager (the interval is set via StreamExecutionEnvironment.enableCheckpointing()) and broadcast to all source operators, flowing downstream with the data.
All data between barrier n‑1 and barrier n belongs to checkpoint n. Downstream operators detect the barrier and trigger snapshot actions without worrying about the impossibility of freezing time.
Snapshotting
Consider a parallelism‑2 WordCount example (a DAG). The snapshot procedure proceeds as follows:
a) Source operators receive the barrier, snapshot their local state (including source offsets), and forward the barrier downstream.
b) & c) Non‑source operators block the input stream that received the barrier, continue consuming other inputs until barriers arrive on all inputs, then snapshot their state and forward the barrier.
d) After snapshotting, operators unblock their inputs and resume processing. Sink operators acknowledge the barrier to the JobManager, completing the checkpoint.
If an operator has a single input, it can snapshot immediately upon receiving the barrier; with multiple inputs, it must wait for all barriers to avoid mixing data from successive checkpoints. This waiting phase is called alignment, and operators use an internal buffer during alignment.
Exactly‑Once vs At‑Least‑Once
The barrier alignment is the foundation of Flink's exactly‑once semantics; it guarantees that operators with multiple inputs correctly separate data belonging to different checkpoints.
For latency‑sensitive applications that can tolerate occasional duplicates, Flink allows at‑least‑once semantics by disabling barrier alignment via StreamExecutionEnvironment.enableCheckpointing(). In this mode, operators do not block on the first barrier, so some data from checkpoint n+1 may be included in checkpoint n and be re‑processed after recovery.
Asynchronous
"Asynchronous" refers to the non‑blocking write of snapshot data to the state backend: after an operator has collected all barriers and triggered a snapshot, it continues processing while the snapshot is written in the background, minimizing latency.
With asynchronous writes, a checkpoint is considered successful only when all stateful operators acknowledge the write in addition to all sinks acknowledging the barrier.
DCG (Directed Cyclic Graph)
If the job’s execution plan contains cycles, a deadlock can occur because operators on the cycle would wait indefinitely for barriers. ABS handles back‑edges specially: the operator at the end of a back‑edge records a local snapshot once all non‑back‑edge inputs have received barriers, then continues to record data arriving from the back‑edge until it sees the same barrier again. This way the back‑edge state is captured and can be restored.
Author: LittleMagic – Original article
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
