An Introduction to Apache Beam and Its Beam Model for Unified Batch and Stream Processing
This article introduces Apache Beam, its Beam Model, and how the Beam SDK enables developers to write unified, flexible pipelines for both bounded batch jobs and unbounded streaming workloads, illustrating concepts with mobile‑gaming examples and detailed code snippets.
Apache Beam (originally Google DataFlow) is a project contributed by Google to the Apache Foundation in February 2016, considered a major contribution to the open‑source community after MapReduce, GFS and BigQuery. Its main goal is to unify batch and stream processing paradigms, providing a simple, flexible, feature‑rich SDK for handling infinite, out‑of‑order, web‑scale datasets.
The Beam project focuses on the programming model and API definition, not on any specific execution engine. Beam aims to let programs written with its SDK run on any distributed compute engine. This article introduces the Beam Model and shows how the Beam SDK can be used to write distributed data‑processing logic, giving readers a first‑hand understanding of Beam and how streaming systems handle unordered infinite data.
Apache Beam Architecture
As distributed data processing evolves, many frameworks have emerged—from early Hadoop MapReduce to Apache Spark, Storm, and more recent Flink, Apex, etc. New frameworks promise higher performance, richer features, lower latency, but switching incurs a high cost: learning a new framework and rewriting all business logic.
The solution consists of two parts: a programming model that unifies and standardizes distributed data‑processing requirements (e.g., batch and stream), and pipelines that can be executed on any runner, allowing users to switch execution engines freely. Apache Beam was created to address these challenges.
Beam consists of the Beam SDK and Beam Runners. The SDK defines the API for writing business logic; the generated pipeline is handed to a Runner for execution. Currently the SDK is implemented in Java, with a Python version under development. Supported runners include Apache Flink, Apache Spark, Google Cloud Dataflow, and discussions are ongoing for Storm, Hadoop, Gearpump, etc. The basic architecture is shown in Figure 1.
Figure 1 Apache Beam architecture diagram
Note that while the Beam community strives for full feature parity across runners, some implementations may lack certain capabilities (e.g., MapReduce‑based runners struggle with streaming features). Google Cloud Dataflow offers the most complete support, and among open‑source runners, Apache Flink is the most feature‑complete.
Beam Model
The Beam Model is the programming paradigm behind the Beam SDK. Before describing it, we briefly introduce the problem domain and basic concepts.
Data. Distributed processing deals with two data types: bounded (finite) datasets such as files in HDFS or HBase tables, which exist beforehand and are persisted; and unbounded (infinite) streams such as Kafka logs or Twitter firehose, which arrive continuously and cannot be fully persisted.
Generally, batch frameworks target bounded data, while streaming frameworks target unbounded data. From a logical perspective the two are similar; the same computation (e.g., hourly sum of retweets) should run on both without changing business logic.
Time. Process time is when data enters the processing system; event time is when the data was generated. These often differ—for example, a tweet created at 12:00 may be processed at 12:01:30. Batch jobs usually ignore timestamps, whereas streaming jobs need to handle windows based on event time.
Out‑of‑order. In streaming, records may arrive out of event‑time order. If windows are defined on process time, order matches arrival, but event‑time windows must cope with late data, which is a challenging problem.
The Beam Model targets infinite, out‑of‑order event‑time streams (finite datasets are a special case). It abstracts the processing concerns into four dimensions:
What – what computation to perform (e.g., sum, join, model training). Specified by operators in a Pipeline.
Where – the scope over which data is computed (e.g., fixed or sliding windows based on process‑time or event‑time). Specified by windowing in the Pipeline.
When – when results are emitted (e.g., every minute within a one‑hour event‑time window). Controlled by watermarks and triggers.
How – how late data is handled (e.g., emit incremental results, combine with on‑time results, or discard). Specified by accumulation mode.
These four dimensions constitute the Beam SDK; developers only need to invoke the appropriate APIs for each dimension to build a pipeline, independent of the underlying execution engine.
Beam SDK
Unlike Flink or Spark, the Beam SDK provides a single API for sources, sinks, and transforms. The article presents four example Beam pipelines for a mobile‑gaming scenario:
User Score – batch job that computes each user’s total score from a bounded dataset.
Team Score per Hour – batch job that aggregates team scores hourly.
Leaderboard – streaming job that produces two metrics: hourly team scores and each user’s real‑time cumulative score.
Game State – streaming job that aggregates hourly team scores and more complex per‑hour metrics such as each user’s online time.
Note: example code is taken from Beam’s source repository (apache/incubator-beam).
Below we analyse each task using the WWWH dimensions and show the corresponding Beam SDK code.
User Score
Computing each user’s cumulative score is a simple batch task; re‑run the job whenever new score data arrives. The WWWH analysis for this task is:
The Java Beam SDK implementation is shown below:
ExtractAndSumScore implements the “What” – grouping by user and summing scores. The code uses MapElements to extract key/value pairs and Sum to aggregate per key. Beam can fuse multiple operations into a single step, improving clarity and reusability.
Team Score per Hour
Aggregating team scores per hour adds a windowing requirement but can still be expressed as a batch job. The WWWH analysis is:
Compared with the User Score task, only the “Where” dimension changes (window definition) and the grouping key switches from user to team; the “What” logic remains the same.
AddEventTimestamps extracts event‑time timestamps; FixedWindowsTeam defines a one‑hour fixed window; ExtractAndSumScore is reused with team as the key. The new windowing code is independent of the computation code, making the pipeline easy to extend.
Leaderboard
The first two tasks are batch jobs on bounded data; the leaderboard requires real‑time results on unbounded streams. With Beam, the same logical pipeline can be used for both batch and streaming; only the source and sink differ.
Streaming jobs can emit intermediate results before a window is complete, providing low‑latency analytics (e.g., hourly team scores updated every minute).
Ensuring correctness despite out‑of‑order data involves handling watermarks, late data, and accumulation modes, allowing low latency while guaranteeing final results.
These concerns correspond to the “When” and “How” dimensions. “When” can be divided into four stages:
Early – emit early results before the window closes.
On‑Time – emit results when the window closes, possibly waiting for late data.
Late – handle data that arrives after the window’s on‑time emission.
Final – after a maximum allowed lateness (e.g., two hours), discard further late data and clean up state.
For the hourly team‑score streaming task, the desired logic is: use an event‑time one‑hour window, emit team scores every five minutes, emit updated results for late data every ten minutes, and discard data arriving more than two hours after the window end. The WWWH representation is shown in Figure 2.
In Beam SDK code, the four dimensions map to: LeaderboardTeamFixedWindows (Where – window definition), Trigger (When – output conditions), Accumulation (How – result handling), ExtractTeamScore (What – computation).
Conclusion
The Beam Model provides an elegant abstraction for processing infinite, out‑of‑order streams. Its WWWH dimensions clearly separate business logic from execution concerns, enabling unified handling of bounded and unbounded data. Many projects such as Flink and Spark Streaming have adopted ideas from Beam. This article introduced the Beam Model and demonstrated how to design real‑world data‑processing tasks with it. Apache Beam is currently an incubating project; readers can follow the official website or mailing lists for updates.
References
1. Apache Beam (incubating)
2. https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
3. The world beyond batch: Streaming 102
4. https://cloud.google.com/dataflow/blog/dataflow-beam-and-spark-comparison
Architects Research Society
A daily treasure trove for architects, expanding your view and depth. We share enterprise, business, application, data, technology, and security architecture, discuss frameworks, planning, governance, standards, and implementation, and explore emerging styles such as microservices, event‑driven, micro‑frontend, big data, data warehousing, IoT, and AI architecture.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.