Big Data 15 min read

Understanding Apache Flink’s Core Design: “Batch Is a Special Case of Stream” and Its Architecture

This article explains Apache Flink’s fundamental design principle that treats batch as a special case of stream, compares native streaming with micro‑batching, describes its deployment modes, fault‑tolerance mechanisms, unified data and scheduling layers, and outlines Alibaba’s architectural optimizations for the platform.

Big Data Technology & Architecture

Mar 12, 2019

Understanding Apache Flink’s Core Design: “Batch Is a Special Case of Stream” and Its Architecture

Core Design Principle

Apache Flink’s "lifeblood" is the concept that "batch is a special case of stream," guiding its native‑streaming system design and enabling ultra‑low latency processing compared to micro‑batching engines like Spark.

Performance Advantage

Flink achieves microsecond‑level latency by processing each incoming record immediately, whereas Spark’s micro‑batching incurs 0.5–2 seconds of delay.

Computation Models

Micro‑Batching : treats stream as a series of tiny batches, leading to higher latency due to batch accumulation.

Native Streaming : processes each record as it arrives, providing the lowest possible latency.

Deployment Modes

Local : runs in a single JVM for development and testing.

Cluster : supports standalone deployment or integration with resource managers such as YARN and Mesos, with a master‑slave architecture and HA (high availability) to avoid single points of failure.

Cloud : integrates with cloud services like Google Compute Engine, AWS EC2, or Alibaba ECS.

Fault Tolerance

Flink provides At‑Least‑Once and Exactly‑Once guarantees using a checkpointing mechanism based on barriers and state backends, enabling end‑to‑end exactly‑once semantics when combined with two‑phase commit in sinks.

Unified Architecture

Flink shares a common data transmission layer (pipelined vs batch), task scheduling layer, and user API layer (DataStream, DataSet, Table API, SQL) for both stream and batch workloads.

SQL and Optimization

TableAPI and SQL are built on top of DataStream/DataSet and are optimized by Apache Calcite using both rule‑based (HepPlanner) and cost‑based (VolcanoPlanner) planners.

Component Stack and Libraries

Flink includes libraries such as CEP (complex event processing), ML (machine learning), and Gelly (graph processing), and offers a rich set of operators for single‑ and multi‑stream processing.

Alibaba Enhancements

Alibaba’s Flink edition adds QP/QE/QO layers for unified query optimization, a DAG API for a common runtime abstraction, and further unifies batch and stream execution graphs.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

stream processing batch processing Apache Flink native streaming

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.