Apache Beam Overview: Architecture, Programming Model, PCollection, Pipeline and Transform
This article provides a comprehensive introduction to Apache Beam, covering its unified batch‑and‑stream processing architecture, programming model, workflow patterns, Lambda and Kappa architectures, the characteristics of PCollection, pipeline construction, core transforms, I/O handling, and includes practical code examples.
1. Introduction
Big data processing is often underestimated, and without a robust data‑processing pipeline, AI cannot achieve true intelligence. Apache Beam offers a unified API for both batch and streaming workloads, allowing developers to focus on data‑processing logic rather than infrastructure differences.
2. Programming Model
Real‑world scenarios involve complex data‑flow requirements. Four common design patterns are illustrated:
2.1 Workflow
Copy pattern : duplicate a data module to multiple downstream modules.
PCollection<String> lines = pipeline.apply(TextIO.read().from("url").withHintMatchesManyFiles());Filter pattern : remove records that do not meet specific criteria.
pipeline.apply(Create.of(kvs))
.apply(Filter.by(kv -> kv.getKey().equals("id:1")))
.apply(JdbcIO.<KV<String,String>>write()...);Split pattern : keep all data but route it to different categories for separate processing.
Merge pattern : combine multiple data streams into a single dataset for further processing.
2.2 Lambda Architecture
Proposed by Nathan Marz, Lambda Architecture consists of three layers—Batch, Speed, and Serving—to provide scalable, fault‑tolerant big‑data processing.
2.3 Kappa Architecture
Jay Kreps introduced Kappa Architecture, which relies on a single streaming engine (e.g., Kafka) to handle both real‑time and historical data, simplifying the stack at the cost of higher operational complexity.
2.4 Summary
Focusing on immutable, unordered, and size‑agnostic data collections enables a flexible, extensible processing pipeline.
3. PCollection
PCollection (Parallel Collection) is the core data abstraction in Beam, similar to Spark's RDD. It is immutable, unordered, and can represent bounded or unbounded datasets.
Coders are required for serialization; they can be registered globally or set per transform.
PipelineOptions options = PipelineOptionsFactory.create();
Pipeline p = Pipeline.create(options);
CoderRegistry cr = p.getCoderRegistry();
cr.registerCoder(Integer.class, BigEndianIntegerCoder.class);4. Pipeline
A Pipeline defines the end‑to‑end data‑processing workflow, from source ingestion to final output.
PipelineOptions options = PipelineOptionsFactory.create();
options.setRunner(DirectRunner.class);
Pipeline pipeline = Pipeline.create(options);
List<KV<String,String>> kvs = new ArrayList<>();
for (int i = 0; i < 10; i++) {
kvs.add(KV.of("id:"+i, "name:"+i));
}
pipeline.apply(Create.of(kvs))
.apply(Filter.by(kv -> kv.getKey().equals("id:1")))
.apply(JdbcIO.<KV<String,String>>write()...);
pipeline.run().waitUntilFinish();5. Transform
Transforms are the basic processing units. Common transforms include ParDo (parallel Do) and GroupByKey. ParDo is implemented by extending the DoFn class with lifecycle methods such as @Setup, @StartBundle, @ProcessElement, @FinishBundle, and @Teardown.
static class DoFnTest<T> extends DoFn<T,T> {
@Setup public void setUp() { ... }
@StartBundle public void startBundle() { ... }
@ProcessElement public void processElement(ProcessContext c) { ... }
@FinishBundle public void finishBundle() { ... }
@Teardown public void teardown() { ... }
}Beam handles bundle failures by re‑processing the entire bundle, ensuring fault tolerance.
6. Pipeline I/O
Reading data is performed with Read transforms (e.g., TextIO.read(), JdbcIO.read()), which produce a PCollection. Writing data uses Write transforms (e.g., TextIO.write(), JdbcIO.write()).
PCollection<String> inputs = p.apply(TextIO.read().from(filepath));
PCollection<Row> rows = pipeline.apply(JdbcIO.read()
.withDataSourceConfiguration(...)
.withQuery("select * from table")
.withRowMapper(...));
p.apply(TextIO.write().to("outputPath").withSuffix(".txt"));7. Author Introduction
Li Meng, currently at Zhiyin Smart Data Technology Co., focuses on data‑center architecture and middleware development in the cloud‑computing and big‑data domain.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
