Big Data 15 min read

Understanding Apache Flink Architecture, Data Transfer, Event‑Time Processing, State Management, and Checkpointing

This article explains Apache Flink's distributed system architecture—including JobManager, ResourceManager, TaskManager, and Dispatcher—covers session and job deployment modes, data transfer mechanisms, event‑time handling with watermarks, various state types and backends, scaling strategies, and the checkpoint/savepoint recovery process.

Architect
Architect
Architect
Understanding Apache Flink Architecture, Data Transfer, Event‑Time Processing, State Management, and Checkpointing

Apache Flink is a distributed stream‑processing engine that must manage cluster resources, persistent storage, and failure recovery; its architecture consists of a JobManager (or JobMaster) that builds JobGraphs and ExecutionGraphs, a ResourceManager (often YARN) that allocates slots on TaskManagers, TaskManagers that run multiple task threads, and a Dispatcher that provides a REST API for job submission.

Flink can run in two deployment modes: Session mode, where a long‑running Flink cluster (AM and TM) is started once and each submitted job gets its own JobManager; and Job mode, where a fresh Flink cluster is launched for each job. Example commands: # start yarn‑session with 4 TM, each 4 GB, 4 slots cd flink-1.7.0/ ./bin/yarn-session.sh -n 4 -jm 1024m -tm 4096m -s 4 # submit a job ./bin/flink run -m yarn‑cluster -yn 4 -yjm 1024m -ytm 4096m ./examples/batch/WordCount.jar

Application deployment can be done in Framework mode (JAR submitted to Dispatcher/JM/YARN) or Library mode (container image such as Docker for micro‑service style jobs).

Task execution differs between streaming (nodes pre‑started) and batch (nodes launched per stage). Slots are the basic resource unit; a slot may host multiple tasks (slot sharing) or be dedicated to a single task, depending on configuration.

Back‑pressure is handled via local and remote exchange mechanisms: when a downstream task slows, the upstream task reduces its sending rate, using Netty water‑marking to pause transmission.

Data transfer between TaskManagers uses 32 KB network buffers and permanent TCP connections; when sender and receiver reside in the same TM, data is serialized to a buffer and deserialized by the receiver, which reduces GC pressure but adds serialization cost.

Event‑time processing requires timestamps and watermarks. Watermarks are monotonically increasing special records that indicate all future records will have timestamps greater than the watermark; operators use timers registered with the internal time service to emit results when watermarks pass the timer’s timestamp. Only ProcessFunction can read or modify timestamps and watermarks.

Flink supports three watermark generation strategies: SourceFunction‑generated timestamps, AssignerWithPeriodicWatermarks (periodic extraction), and AssignerWithPunctuatedWatermarks (per‑record extraction).

State management is divided into Operator State (accessible by all parallel instances of the same operator) and Keyed State (partitioned by key). Operator State includes ListState, UnionListState, and BroadcastState; Keyed State includes ValueState, ListState, and MapState. State can be stored as raw or managed, with managed being the recommended approach.

State backends (MemoryStateBackend, FsStateBackend, RocksDBStateBackend) determine where state is kept and how checkpoints are written; RocksDB supports asynchronous and incremental checkpoints, though it has a 2 GB per‑key/value size limit.

Scaling stateful operators can be achieved by redistributing keyed state, list state, union list state, or broadcast state across parallelism changes.

Flink's lightweight checkpointing builds on the Chandy‑Lamport algorithm, using checkpoint barriers that travel with the stream. When a source receives a barrier it stops emitting records, triggers a local state snapshot, and broadcasts the barrier downstream. Once all downstream tasks have received the barrier, the checkpoint is considered complete.

Optimizations include asynchronous incremental checkpoints for large state (RocksDB) and the ability to continue processing when using at‑least‑once semantics.

Savepoints extend checkpoints with additional metadata, are created manually, and can be used to restart jobs with different parallelism, migrate to new clusters, or debug faulty runs.

Recovery from consistent checkpoints relies on sources being resettable (e.g., Kafka) and on exactly‑once sink implementations or idempotent updates to avoid duplicate downstream results.

Reference: Stream Processing with Apache Flink by Vasiliki Kalavri and Fabian Hueske.

big datastream processingstate managementApache Flinkevent timecheckpointing
Architect
Written by

Architect

Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.