Master Apache Beam: Build a Portable Word Count Pipeline in Minutes
This tutorial introduces Apache Beam’s unified programming model for batch and streaming, explains its core concepts and terminology, compares it with other runners, and walks through a complete Java word‑count example—including dependencies, pipeline construction, transforms, and execution with DirectRunner.
1. Overview
In this tutorial we introduce Apache Beam and explore its basic concepts. We first demonstrate use cases and benefits, then present core terminology, and finally illustrate all important aspects with a simple example.
2. What is Apache Beam?
Apache Beam (Batch+Stream) is a unified programming model for batch and streaming data processing jobs. It provides an SDK to define and build pipelines and runners that execute them on various distributed back‑ends such as Apache Apex, Flink, Gearpump, Samza, Spark, Google Cloud Dataflow, and Hazelcast Jet.
3. Why Choose Apache Beam?
Beam merges batch and streaming processing, making it easy to switch between them. It improves portability and flexibility by focusing on logical pipeline definitions rather than underlying details, and supports multiple SDK languages (Java, Python, Go, Scala).
4. Core Concepts
The key concepts are:
PCollection – a bounded or unbounded dataset.
PTransform – an operation that takes one or more PCollections and produces zero or more PCollections.
Pipeline – a directed acyclic graph of PCollections and PTransforms.
PipelineRunner – executes the pipeline on a chosen distributed back‑end.
5. Word Count Example
We design a word‑count pipeline with the following steps: read lines, split into words, lowercase, trim punctuation, filter stopwords, and count unique words.
5.1 Build the Beam pipeline
<dependency>
<groupId>org.apache.beam</groupId>
<artifactId>beam-sdks-java-core</artifactId>
<version>${beam.version}</version>
</dependency>We add the DirectRunner runtime dependency:
<dependency>
<groupId>org.apache.beam</groupId>
<artifactId>beam-runners-direct-java</artifactId>
<version>${beam.version}</version>
<scope>runtime</scope>
</dependency>5.3 Implementation
Creating the pipeline:
PipelineOptions options = PipelineOptionsFactory.create();
Pipeline p = Pipeline.create(options);Six‑step word‑count pipeline:
PCollection<KV<String, Long>> wordCount = p
.apply("(1) Read all lines", TextIO.read().from(inputFilePath))
.apply("(2) Flatmap to a list of words", FlatMapElements.into(TypeDescriptors.strings())
.via(line -> Arrays.asList(line.split("\\s"))))
.apply("(3) Lowercase all", MapElements.into(TypeDescriptors.strings())
.via(word -> word.toLowerCase()))
.apply("(4) Trim punctuations", MapElements.into(TypeDescriptors.strings())
.via(word -> trim(word)))
.apply("(5) Filter stopwords", Filter.by(word -> !isStopWord(word)))
.apply("(6) Count words", Count.perElement());Each apply() call performs a specific transformation, from reading the input file to counting occurrences.
Finally we write the results:
wordCount.apply(MapElements.into(TypeDescriptors.strings())
.via(count -> count.getKey() + " --> " + count.getValue()))
.apply(TextIO.write().to(outputFilePath));Running the pipeline:
p.run().waitUntilFinish();The output consists of word‑count pairs such as apache --> 3, beam --> 5, etc.
6. Conclusion
We have learned what Apache Beam is, why it is popular, and how to implement a basic word‑count job using its unified model.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Programmer DD
A tinkering programmer and author of "Spring Cloud Microservices in Action"
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
