Big Data 13 min read

Understanding Google Dataflow: Model, Windowing, Triggers, and Incremental Processing

This article explains the Google Dataflow model, covering its unified batch‑and‑stream architecture, windowing and triggering mechanisms, core primitives, time domains, and how these concepts form the foundation of modern big‑data stream processing systems.

Big Data Technology & Architecture

Jul 23, 2019

Understanding Google Dataflow: Model, Windowing, Triggers, and Incremental Processing

0. Introduction

Today we continue the discussion on streaming computation. After Alibaba’s acquisition of Apache Flink, Flink’s popularity surged, and together with Apache Spark they dominate real‑time stream processing. Google Dataflow is considered a cornerstone of modern streaming, as described in the book “Streaming Systems”.

“There were two main reasons for Flink’s rise to prominence: Its rapid adoption of the Dataflow/Beam programming model, which put it in the position of being the most semantically capable fully open source streaming system on the planet at the time. Followed shortly thereafter by its highly efficient snapshotting implementation (derived from research in Chandy and Lamport’s original paper ‘Distributed Snapshots: Determining Global States of Distributed Systems’), which gave it the strong consistency guarantees needed for correctness.” — Tyler Akidau, Slava Chernyak, Reuven Lax, Streaming Systems

In short, Flink succeeded because it adopted the Google Dataflow/Beam programming model and implemented a distributed asynchronous snapshot algorithm derived from Chandy‑Lamport.

Apache Spark’s 2018 paper also notes that Structured Streaming combines elements of Google Dataflow, incremental queries and Spark Streaming.

Therefore calling Google Dataflow the foundation of modern stream processing is not an exaggeration. This article reviews the Dataflow model, mainly based on the 2015 VLDB paper “The dataflow model: a practical approach to balancing correctness, latency, and cost in massive‑scale, unbounded, out‑of‑order data processing”.

1. Overview

Google Dataflow provides a unified batch‑and‑stream system and runs on Google Cloud. Internally it uses Flume (Google’s FlumeJava, not Apache Flume) and MillWheel, Google’s internal streaming engine.

The core ideas of the Dataflow model are:

Event‑time ordered processing, feature‑based window aggregation, and a balance among correctness, latency and cost.

Decompose data processing into four dimensions: What, Where, When, How.

These dimensions lead to a decoupling of logical concepts from physical implementation.

The model consists of several components:

Windowing Model : supports non‑aligned event‑time windows.

Triggering Model : declarative API to control when window results are emitted.

Incremental Processing Model : integrates updates into the Window and Trigger models.

Scalable Implementation : SDK built on MillWheel and Flume, hidden from users.

Core Principle : design philosophy of the model.

2. Core Concepts

2.1 Bounded/Unbounded vs Streaming/Batch

Dataflow distinguishes between bounded (finite) and unbounded (infinite) datasets, while “Streaming” and “Batch” refer to the execution engine.

2.2 Window

A window groups a subset of data for operations such as aggregation, outer join, or time‑bounded joins. Three window types are supported: Fixed (Tumbling), Sliding, and Session windows.

2.3 Time Domain

Two time notions are used: Event Time (when the event occurred) and Processing Time (when the system processes the event). Watermarks visualize the skew between them.

3. Dataflow Model

3.1 Core Primitives

Dataflow provides two primitive operations on (key, value) pairs: ParDo (map/flatMap‑like transformation) and GroupByKey (aggregation), the latter requiring a window.

3.2 Window

Dataflow implements GroupByKeyAndWindow with non‑aligned windows via two steps: AssignWindows (assign data to zero or more windows) and MergeWindows (merge overlapping windows). Data are represented as (key, value, event_time, window) tuples.

3.3 Triggers & Incremental Processing

Triggers decide when to emit window results. Three handling strategies exist:

Discarding : results are dropped after a trigger.

Accumulating : results are retained so later data can update them (similar to Lambda architecture).

Accumulating & Retracting : retains results and can retract previous output before emitting an updated result.

4. Summary

The four dimensions map to concrete mechanisms: What → Transformation, Where → Window, When → Watermark & Trigger, How → Discarding/Accumulating/Accumulating & Retracting. This systematic approach influences modern stream engines such as Apache Flink and Apache Spark.

For more details, refer to the original VLDB paper.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Streaming Windowing Google Cloud Triggers DataFlow

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.