Big Data 14 min read

Apache Beam Overview: Architecture, Programming Model, PCollection, Pipeline and Transform

This article provides a comprehensive introduction to Apache Beam, covering its unified batch‑and‑stream processing architecture, programming model, workflow patterns, Lambda and Kappa architectures, the characteristics of PCollection, pipeline construction, core transforms, I/O handling, and includes practical code examples.

Big Data Technology & Architecture

May 10, 2020

Apache Beam Overview: Architecture, Programming Model, PCollection, Pipeline and Transform

1. Introduction

Big data processing is often underestimated, and without a robust data‑processing pipeline, AI cannot achieve true intelligence. Apache Beam offers a unified API for both batch and streaming workloads, allowing developers to focus on data‑processing logic rather than infrastructure differences.

2. Programming Model

Real‑world scenarios involve complex data‑flow requirements. Four common design patterns are illustrated:

2.1 Workflow

Copy pattern : duplicate a data module to multiple downstream modules.

PCollection<String> lines = pipeline.apply(TextIO.read().from("url").withHintMatchesManyFiles());

Filter pattern : remove records that do not meet specific criteria.

pipeline.apply(Create.of(kvs))
        .apply(Filter.by(kv -> kv.getKey().equals("id:1")))
        .apply(JdbcIO.<KV<String,String>>write()...);

Split pattern : keep all data but route it to different categories for separate processing.

Merge pattern : combine multiple data streams into a single dataset for further processing.

2.2 Lambda Architecture

Proposed by Nathan Marz, Lambda Architecture consists of three layers—Batch, Speed, and Serving—to provide scalable, fault‑tolerant big‑data processing.

2.3 Kappa Architecture

Jay Kreps introduced Kappa Architecture, which relies on a single streaming engine (e.g., Kafka) to handle both real‑time and historical data, simplifying the stack at the cost of higher operational complexity.

2.4 Summary

Focusing on immutable, unordered, and size‑agnostic data collections enables a flexible, extensible processing pipeline.

3. PCollection

PCollection (Parallel Collection) is the core data abstraction in Beam, similar to Spark's RDD. It is immutable, unordered, and can represent bounded or unbounded datasets.

Coders are required for serialization; they can be registered globally or set per transform.

PipelineOptions options = PipelineOptionsFactory.create();
Pipeline p = Pipeline.create(options);
CoderRegistry cr = p.getCoderRegistry();
cr.registerCoder(Integer.class, BigEndianIntegerCoder.class);

4. Pipeline

A Pipeline defines the end‑to‑end data‑processing workflow, from source ingestion to final output.

PipelineOptions options = PipelineOptionsFactory.create();
options.setRunner(DirectRunner.class);
Pipeline pipeline = Pipeline.create(options);
List<KV<String,String>> kvs = new ArrayList<>();
for (int i = 0; i < 10; i++) {
    kvs.add(KV.of("id:"+i, "name:"+i));
}
pipeline.apply(Create.of(kvs))
        .apply(Filter.by(kv -> kv.getKey().equals("id:1")))
        .apply(JdbcIO.<KV<String,String>>write()...);
pipeline.run().waitUntilFinish();

5. Transform

Transforms are the basic processing units. Common transforms include ParDo (parallel Do) and GroupByKey. ParDo is implemented by extending the DoFn class with lifecycle methods such as @Setup, @StartBundle, @ProcessElement, @FinishBundle, and @Teardown.

static class DoFnTest<T> extends DoFn<T,T> {
    @Setup public void setUp() { ... }
    @StartBundle public void startBundle() { ... }
    @ProcessElement public void processElement(ProcessContext c) { ... }
    @FinishBundle public void finishBundle() { ... }
    @Teardown public void teardown() { ... }
}

Beam handles bundle failures by re‑processing the entire bundle, ensuring fault tolerance.

6. Pipeline I/O

Reading data is performed with Read transforms (e.g., TextIO.read(), JdbcIO.read()), which produce a PCollection. Writing data uses Write transforms (e.g., TextIO.write(), JdbcIO.write()).

PCollection<String> inputs = p.apply(TextIO.read().from(filepath));
PCollection<Row> rows = pipeline.apply(JdbcIO.read()
    .withDataSourceConfiguration(...)
    .withQuery("select * from table")
    .withRowMapper(...));

p.apply(TextIO.write().to("outputPath").withSuffix(".txt"));

7. Author Introduction

Li Meng, currently at Zhiyin Smart Data Technology Co., focuses on data‑center architecture and middleware development in the cloud‑computing and big‑data domain.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data data processing pipeline Lambda architecture transform Apache Beam PCollection

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.