Big Data 10 min read

Understanding Mini‑Batch Streaming Aggregation in Flink SQL

This article explains Flink SQL’s streaming aggregation Mini‑Batch feature, covering its purpose, configuration parameters, underlying optimizer rules, operator implementations, watermark handling, buffer processing, and the optional Local‑Global two‑phase aggregation optimization for improving throughput and reducing state overhead in large‑scale data pipelines.

Big Data Technology & Architecture

Feb 23, 2022

Understanding Mini‑Batch Streaming Aggregation in Flink SQL

Streaming aggregation (streaming aggregation) is a common scenario in real‑time business logic, but it can easily cause performance issues. Flink SQL allows users to implement streaming aggregation via the GROUP BY clause and provides built‑in optimizations such as Mini‑Batch.

Mini‑Batch Overview – Similar to Spark Streaming’s micro‑batch, Mini‑Batch buffers incoming records in the operator until a size or time threshold is reached, then performs aggregation. This reduces the number of state reads/writes per key, especially beneficial when the state backend (e.g., RocksDB) has high serialization costs.

Enabling Mini‑Batch requires three configuration parameters:

val tEnv: TableEnvironment = ...
val configuration = tEnv.getConfig().getConfiguration()

configuration.setString("table.exec.mini-batch.enabled", "true")          // enable
configuration.setString("table.exec.mini-batch.allow-latency", "5 s")   // latency timeout
configuration.setString("table.exec.mini-batch.size", "5000")          // buffer size

The optimizer rule MiniBatchIntervalInferRule creates the physical node StreamExecMiniBatchAssigner, which is attached after the source. Its translateToPlanInternal() method decides whether to use a processing‑time or event‑time assigner based on the Mini‑Batch mode.

For processing‑time mode, the operator ProcTimeMiniBatchAssignerOperator emits watermarks to mark batch boundaries and registers timers according to the table.exec.mini-batch.allow-latency interval. For event‑time mode, RowTimeMiniBatchAssginerOperator forwards only watermarks that align with the batch interval.

When a group aggregation node StreamExecGroupAggregate detects Mini‑Batch is enabled, it creates a KeyedMapBundleOperator with a MiniBatchGroupAggFunction. The operator maintains a buffer ( bundle) and a BundleTrigger that fires when the buffer reaches the configured size or when a watermark arrives.

@Override
public void processElement(StreamRecord<IN> element) throws Exception {
    final IN input = element.getValue();
    final K bundleKey = getKey(input);
    final V bundleValue = bundle.get(bundleKey);
    final V newBundleValue = function.addInput(bundleValue, input);
    bundle.put(bundleKey, newBundleValue);
    numOfElements++;
    bundleTrigger.onElement(input);
}

@Override
public void finishBundle() throws Exception {
    if (!bundle.isEmpty()) {
        numOfElements = 0;
        function.finishBundle(bundle, collector);
        bundle.clear();
    }
    bundleTrigger.reset();
}

The MiniBatchGroupAggFunction uses code‑generation to create an AggsHandleFunction that performs accumulator updates and emits changelog results.

Local‑Global Optimization – By setting table.optimizer.agg-phase-strategy" to "TWO_PHASE", Flink applies a two‑phase aggregation (local then global) using the optimizer rule TwoStageOptimizedAggregateRule. The physical nodes StreamExecLocalGroupAggregate and StreamExecGlobalGroupAggregate are created, each employing generated functions MiniBatchLocalGroupAggFunction and MiniBatchGlobalGroupAggFunction respectively.

Overall, Mini‑Batch reduces state operation overhead at the cost of added latency, and the optional Local‑Global two‑phase aggregation further mitigates data skew and improves throughput for large‑scale streaming jobs.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Flink SQL Streaming Aggregation Mini-Batch

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.