Understanding State Management and Scaling in Apache Flink
This article explains how Apache Flink uses state for incremental stream processing, describes the different state backends, details the persistence mechanism, and shows how both OperatorState and KeyedState are redistributed during scaling using partition and key‑group algorithms.
In stream processing, Apache Flink relies on state to perform incremental computations, storing intermediate results such as aggregation values or source offsets so that failures can be recovered without reprocessing the entire data history.
What Is State
State in Flink represents the internal data (both computation results and metadata) that operators keep during execution. It is persisted as snapshots to survive node failures.
Why State Is Needed
Unlike batch jobs, Flink processes data continuously and must retain the result of previous computations. State enables fail‑over recovery and supports incremental processing.
State Implementations
HeapStateBackend – memory‑only, for debugging.
FsStateBackend – stores checkpoints on HDFS, incurs high network I/O.
RocksDBStateBackend – local RocksDB with asynchronous HDFS persistence (the most common choice).
NiagaraStateBackend – Alibaba’s proprietary distributed backend.
State Persistence Logic
Flink typically uses RocksDB + HDFS: data is first written locally to RocksDB, then asynchronously synced to HDFS, combining low‑latency local writes with durable remote storage.
State Classification
Flink distinguishes two main types of state:
KeyedState – scoped to a key (e.g., a GroupBy field). Each key has its own isolated state.
OperatorState – used by source connectors to store offsets and other operator‑wide metadata.
Scaling and State Reallocation
When the parallelism of a job changes, Flink must redistribute state among the new operator instances.
OperatorState Scaling
OperatorState is stored as a List<Tuple2<InputSplit, Long>>, where each tuple holds a partition identifier and its current offset. The redistribution algorithm assigns partitions to instances using a simple modulo operation:
List<Integer> assignedPartitions = new LinkedList<>();
for (int i = 0; i < partitions; i++) {
if (i % consumerCount == consumerIndex) {
assignedPartitions.add(i);
}
}This ensures each parallel instance receives a disjoint subset of partitions.
KeyedState Scaling
KeyedState cannot be redistributed with a naïve modulo because the state size may be large. Flink uses Key‑Groups as the atomic unit of state distribution. A key‑group is a range of keys; the total number of key‑groups equals the job’s maxParallelism (default 4096).
Key assignment to a key‑group is performed by hashing the key and taking the remainder:
public static int assignToKeyGroup(Object key, int maxParallelism) {
return computeKeyGroupForKeyHash(key.hashCode(), maxParallelism);
}
public static int computeKeyGroupForKeyHash(int keyHash, int maxParallelism) {
return HashPartitioner.INSTANCE.partition(keyHash, maxParallelism);
}During scaling, each operator instance receives a contiguous range of key‑groups calculated by KeyGroupRangeAssignment.computeKeyGroupRangeForOperatorIndex. The algorithm first distributes an equal number of key‑groups to each instance and then spreads any remainder among the first few instances.
public static KeyGroupRange computeKeyGroupRangeForOperatorIndex(
int maxParallelism, int parallelism, int operatorIndex) {
GroupRange splitRange = GroupRange.of(0, maxParallelism)
.getSplitRange(parallelism, operatorIndex);
int startGroup = splitRange.getStartGroup();
int endGroup = splitRange.getEndGroup();
return new KeyGroupRange(startGroup, endGroup - 1);
}Thus, when scaling from parallelism 2 to 3, most state remains local, and only the key‑groups that move to new instances are transferred over the network.
By keeping the number of key‑groups high (default 4096), Flink minimizes the chance that scaling is blocked by insufficient groups.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
