Big Data 8 min read

Master Apache Beam: Build a Portable Word Count Pipeline in Minutes

This tutorial introduces Apache Beam’s unified programming model for batch and streaming, explains its core concepts and terminology, compares it with other runners, and walks through a complete Java word‑count example—including dependencies, pipeline construction, transforms, and execution with DirectRunner.

Programmer DD
Programmer DD
Programmer DD
Master Apache Beam: Build a Portable Word Count Pipeline in Minutes

1. Overview

In this tutorial we introduce Apache Beam and explore its basic concepts. We first demonstrate use cases and benefits, then present core terminology, and finally illustrate all important aspects with a simple example.

2. What is Apache Beam?

Apache Beam (Batch+Stream) is a unified programming model for batch and streaming data processing jobs. It provides an SDK to define and build pipelines and runners that execute them on various distributed back‑ends such as Apache Apex, Flink, Gearpump, Samza, Spark, Google Cloud Dataflow, and Hazelcast Jet.

3. Why Choose Apache Beam?

Beam merges batch and streaming processing, making it easy to switch between them. It improves portability and flexibility by focusing on logical pipeline definitions rather than underlying details, and supports multiple SDK languages (Java, Python, Go, Scala).

4. Core Concepts

The key concepts are:

PCollection – a bounded or unbounded dataset.

PTransform – an operation that takes one or more PCollections and produces zero or more PCollections.

Pipeline – a directed acyclic graph of PCollections and PTransforms.

PipelineRunner – executes the pipeline on a chosen distributed back‑end.

5. Word Count Example

We design a word‑count pipeline with the following steps: read lines, split into words, lowercase, trim punctuation, filter stopwords, and count unique words.

5.1 Build the Beam pipeline

<dependency>
    <groupId>org.apache.beam</groupId>
    <artifactId>beam-sdks-java-core</artifactId>
    <version>${beam.version}</version>
</dependency>

We add the DirectRunner runtime dependency:

<dependency>
    <groupId>org.apache.beam</groupId>
    <artifactId>beam-runners-direct-java</artifactId>
    <version>${beam.version}</version>
    <scope>runtime</scope>
</dependency>

5.3 Implementation

Creating the pipeline:

PipelineOptions options = PipelineOptionsFactory.create();
Pipeline p = Pipeline.create(options);

Six‑step word‑count pipeline:

PCollection<KV<String, Long>> wordCount = p
    .apply("(1) Read all lines", TextIO.read().from(inputFilePath))
    .apply("(2) Flatmap to a list of words", FlatMapElements.into(TypeDescriptors.strings())
        .via(line -> Arrays.asList(line.split("\\s"))))
    .apply("(3) Lowercase all", MapElements.into(TypeDescriptors.strings())
        .via(word -> word.toLowerCase()))
    .apply("(4) Trim punctuations", MapElements.into(TypeDescriptors.strings())
        .via(word -> trim(word)))
    .apply("(5) Filter stopwords", Filter.by(word -> !isStopWord(word)))
    .apply("(6) Count words", Count.perElement());

Each apply() call performs a specific transformation, from reading the input file to counting occurrences.

Finally we write the results:

wordCount.apply(MapElements.into(TypeDescriptors.strings())
    .via(count -> count.getKey() + " --> " + count.getValue()))
    .apply(TextIO.write().to(outputFilePath));

Running the pipeline:

p.run().waitUntilFinish();

The output consists of word‑count pairs such as apache --> 3, beam --> 5, etc.

6. Conclusion

We have learned what Apache Beam is, why it is popular, and how to implement a basic word‑count job using its unified model.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

JavaStreamingDistributed ProcessingApache Beamword countDataflow
Programmer DD
Written by

Programmer DD

A tinkering programmer and author of "Spring Cloud Microservices in Action"

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.