Understanding Google Dataflow: Model, Windowing, Triggers, and Incremental Processing
This article explains the Google Dataflow model, covering its unified batch‑and‑stream architecture, windowing and triggering mechanisms, core primitives, time domains, and how these concepts form the foundation of modern big‑data stream processing systems.
0. Introduction
Today we continue the discussion on streaming computation. After Alibaba’s acquisition of Apache Flink, Flink’s popularity surged, and together with Apache Spark they dominate real‑time stream processing. Google Dataflow is considered a cornerstone of modern streaming, as described in the book “Streaming Systems”.
“There were two main reasons for Flink’s rise to prominence: Its rapid adoption of the Dataflow/Beam programming model, which put it in the position of being the most semantically capable fully open source streaming system on the planet at the time. Followed shortly thereafter by its highly efficient snapshotting implementation (derived from research in Chandy and Lamport’s original paper ‘Distributed Snapshots: Determining Global States of Distributed Systems’), which gave it the strong consistency guarantees needed for correctness.” — Tyler Akidau, Slava Chernyak, Reuven Lax, Streaming Systems
In short, Flink succeeded because it adopted the Google Dataflow/Beam programming model and implemented a distributed asynchronous snapshot algorithm derived from Chandy‑Lamport.
Apache Spark’s 2018 paper also notes that Structured Streaming combines elements of Google Dataflow, incremental queries and Spark Streaming.
Therefore calling Google Dataflow the foundation of modern stream processing is not an exaggeration. This article reviews the Dataflow model, mainly based on the 2015 VLDB paper “The dataflow model: a practical approach to balancing correctness, latency, and cost in massive‑scale, unbounded, out‑of‑order data processing”.
1. Overview
Google Dataflow provides a unified batch‑and‑stream system and runs on Google Cloud. Internally it uses Flume (Google’s FlumeJava, not Apache Flume) and MillWheel, Google’s internal streaming engine.
The core ideas of the Dataflow model are:
Event‑time ordered processing, feature‑based window aggregation, and a balance among correctness, latency and cost.
Decompose data processing into four dimensions: What, Where, When, How.
These dimensions lead to a decoupling of logical concepts from physical implementation.
The model consists of several components:
Windowing Model : supports non‑aligned event‑time windows.
Triggering Model : declarative API to control when window results are emitted.
Incremental Processing Model : integrates updates into the Window and Trigger models.
Scalable Implementation : SDK built on MillWheel and Flume, hidden from users.
Core Principle : design philosophy of the model.
2. Core Concepts
2.1 Bounded/Unbounded vs Streaming/Batch
Dataflow distinguishes between bounded (finite) and unbounded (infinite) datasets, while “Streaming” and “Batch” refer to the execution engine.
2.2 Window
A window groups a subset of data for operations such as aggregation, outer join, or time‑bounded joins. Three window types are supported: Fixed (Tumbling), Sliding, and Session windows.
2.3 Time Domain
Two time notions are used: Event Time (when the event occurred) and Processing Time (when the system processes the event). Watermarks visualize the skew between them.
3. Dataflow Model
3.1 Core Primitives
Dataflow provides two primitive operations on (key, value) pairs: ParDo (map/flatMap‑like transformation) and GroupByKey (aggregation), the latter requiring a window.
3.2 Window
Dataflow implements GroupByKeyAndWindow with non‑aligned windows via two steps: AssignWindows (assign data to zero or more windows) and MergeWindows (merge overlapping windows). Data are represented as (key, value, event_time, window) tuples.
3.3 Triggers & Incremental Processing
Triggers decide when to emit window results. Three handling strategies exist:
Discarding : results are dropped after a trigger.
Accumulating : results are retained so later data can update them (similar to Lambda architecture).
Accumulating & Retracting : retains results and can retract previous output before emitting an updated result.
4. Summary
The four dimensions map to concrete mechanisms: What → Transformation, Where → Window, When → Watermark & Trigger, How → Discarding/Accumulating/Accumulating & Retracting. This systematic approach influences modern stream engines such as Apache Flink and Apache Spark.
For more details, refer to the original VLDB paper.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
