Big Data 15 min read

Flink Checkpoint Principle Analysis and Failure Cause Investigation

The article thoroughly explains Apache Flink’s checkpoint mechanism—including state types, coordinator workflow, exactly‑once versus at‑least‑once semantics, common failure sources such as code exceptions, storage or network issues, and practical configuration tips like interval settings, local recovery and externalized checkpoints.

Youzan Coder

Feb 28, 2020

Flink Checkpoint Principle Analysis and Failure Cause Investigation

This article provides a comprehensive analysis of Apache Flink Checkpoint mechanism, which is a fault tolerance recovery mechanism ensuring real-time programs can recover from exceptions or machine failures. The content covers the fundamental concepts of Flink Checkpoint and state management, including Operator State and KeyedState (such as MapState, ListState, ValueState).

The article explains the Checkpoint workflow in detail: JobManager triggers Checkpoint through CheckpointCoordinator, broadcasts CheckpointBarrier to downstream operators via Source tasks, and TaskManager performs checkpoint after receiving all input barriers. The Checkpoint completes only when all Sink operators confirm completion. Key methods in CheckpointCoordinator include triggerCheckpoint, triggerSavepoint, restoreSavepoint, restoreLatestCheckpointedState, and receiveAcknowledgeMessage.

Two semantic options are discussed: Exactly_Once (default) ensures each data affects state only once, while At_Least_Once may calculate data multiple times. The difference lies in CheckpointBarrier alignment: Exactly_Once blocks data processing until all barriers align, whereas At_Least_Once continues processing.

Common failure causes include: unhandled exceptions in user code, external storage interaction failures (Kafka, Redis, HBase), memory issues/OOM, and network problems. The article provides configuration recommendations such as setting minimum checkpoint interval, enabling local recovery with state.backend.local-recovery, and increasing checkpoint retention count.

Code example for enabling Checkpoint:

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

env.enableCheckpointing(10000);

env.getCheckpointConfig().enableExternalizedCheckpoints(ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

State Management Apache Flink fault tolerance Real-Time Computing Checkpoint Exactly-once

Written by

Youzan Coder

Official Youzan tech channel, delivering technical insights and occasional daily updates from the Youzan tech team.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.