Understanding Apache Flink’s Core Design: “Batch Is a Special Case of Stream” and Its Architecture
This article explains Apache Flink’s fundamental design principle that treats batch as a special case of stream, compares native streaming with micro‑batching, describes its deployment modes, fault‑tolerance mechanisms, unified data and scheduling layers, and outlines Alibaba’s architectural optimizations for the platform.
Core Design Principle
Apache Flink’s "lifeblood" is the concept that "batch is a special case of stream," guiding its native‑streaming system design and enabling ultra‑low latency processing compared to micro‑batching engines like Spark.
Performance Advantage
Flink achieves microsecond‑level latency by processing each incoming record immediately, whereas Spark’s micro‑batching incurs 0.5–2 seconds of delay.
Computation Models
Micro‑Batching : treats stream as a series of tiny batches, leading to higher latency due to batch accumulation.
Native Streaming : processes each record as it arrives, providing the lowest possible latency.
Deployment Modes
Local : runs in a single JVM for development and testing.
Cluster : supports standalone deployment or integration with resource managers such as YARN and Mesos, with a master‑slave architecture and HA (high availability) to avoid single points of failure.
Cloud : integrates with cloud services like Google Compute Engine, AWS EC2, or Alibaba ECS.
Fault Tolerance
Flink provides At‑Least‑Once and Exactly‑Once guarantees using a checkpointing mechanism based on barriers and state backends, enabling end‑to‑end exactly‑once semantics when combined with two‑phase commit in sinks.
Unified Architecture
Flink shares a common data transmission layer (pipelined vs batch), task scheduling layer, and user API layer (DataStream, DataSet, Table API, SQL) for both stream and batch workloads.
SQL and Optimization
TableAPI and SQL are built on top of DataStream/DataSet and are optimized by Apache Calcite using both rule‑based (HepPlanner) and cost‑based (VolcanoPlanner) planners.
Component Stack and Libraries
Flink includes libraries such as CEP (complex event processing), ML (machine learning), and Gelly (graph processing), and offers a rich set of operators for single‑ and multi‑stream processing.
Alibaba Enhancements
Alibaba’s Flink edition adds QP/QE/QO layers for unified query optimization, a DAG API for a common runtime abstraction, and further unifies batch and stream execution graphs.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
