Big Data 14 min read

Understanding State Management and Scaling in Apache Flink

This article explains how Apache Flink uses state for incremental stream processing, describes the different state backends, details the persistence mechanism, and shows how both OperatorState and KeyedState are redistributed during scaling using partition and key‑group algorithms.

Big Data Technology & Architecture

Mar 27, 2019

Understanding State Management and Scaling in Apache Flink

In stream processing, Apache Flink relies on state to perform incremental computations, storing intermediate results such as aggregation values or source offsets so that failures can be recovered without reprocessing the entire data history.

What Is State

State in Flink represents the internal data (both computation results and metadata) that operators keep during execution. It is persisted as snapshots to survive node failures.

Why State Is Needed

Unlike batch jobs, Flink processes data continuously and must retain the result of previous computations. State enables fail‑over recovery and supports incremental processing.

State Implementations

HeapStateBackend – memory‑only, for debugging.

FsStateBackend – stores checkpoints on HDFS, incurs high network I/O.

RocksDBStateBackend – local RocksDB with asynchronous HDFS persistence (the most common choice).

NiagaraStateBackend – Alibaba’s proprietary distributed backend.

State Persistence Logic

Flink typically uses RocksDB + HDFS: data is first written locally to RocksDB, then asynchronously synced to HDFS, combining low‑latency local writes with durable remote storage.

State Classification

Flink distinguishes two main types of state:

KeyedState – scoped to a key (e.g., a GroupBy field). Each key has its own isolated state.

OperatorState – used by source connectors to store offsets and other operator‑wide metadata.

Scaling and State Reallocation

When the parallelism of a job changes, Flink must redistribute state among the new operator instances.

OperatorState Scaling

OperatorState is stored as a List<Tuple2<InputSplit, Long>>, where each tuple holds a partition identifier and its current offset. The redistribution algorithm assigns partitions to instances using a simple modulo operation:

List<Integer> assignedPartitions = new LinkedList<>();
for (int i = 0; i < partitions; i++) {
    if (i % consumerCount == consumerIndex) {
        assignedPartitions.add(i);
    }
}

This ensures each parallel instance receives a disjoint subset of partitions.

KeyedState Scaling

KeyedState cannot be redistributed with a naïve modulo because the state size may be large. Flink uses Key‑Groups as the atomic unit of state distribution. A key‑group is a range of keys; the total number of key‑groups equals the job’s maxParallelism (default 4096).

Key assignment to a key‑group is performed by hashing the key and taking the remainder:

public static int assignToKeyGroup(Object key, int maxParallelism) {
    return computeKeyGroupForKeyHash(key.hashCode(), maxParallelism);
}

public static int computeKeyGroupForKeyHash(int keyHash, int maxParallelism) {
    return HashPartitioner.INSTANCE.partition(keyHash, maxParallelism);
}

During scaling, each operator instance receives a contiguous range of key‑groups calculated by KeyGroupRangeAssignment.computeKeyGroupRangeForOperatorIndex. The algorithm first distributes an equal number of key‑groups to each instance and then spreads any remainder among the first few instances.

public static KeyGroupRange computeKeyGroupRangeForOperatorIndex(
        int maxParallelism, int parallelism, int operatorIndex) {
    GroupRange splitRange = GroupRange.of(0, maxParallelism)
        .getSplitRange(parallelism, operatorIndex);
    int startGroup = splitRange.getStartGroup();
    int endGroup = splitRange.getEndGroup();
    return new KeyGroupRange(startGroup, endGroup - 1);
}

Thus, when scaling from parallelism 2 to 3, most state remains local, and only the key‑groups that move to new instances are transferred over the network.

By keeping the number of key‑groups high (default 4096), Flink minimizes the chance that scaling is blocked by insufficient groups.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

State Management Apache Flink scaling KeyedState OperatorState

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.