Big Data 13 min read

Understanding Flink’s Architecture: From APIs to Cluster Deployment

This article explains Flink’s three‑layer architecture (APIs & Libraries, Core, Deploy), details its programming interfaces, runtime engine, deployment options, and core concepts such as stateful computation and time semantics, providing a comprehensive guide for building robust stream and batch applications.

Python Crawling & Data Mining

Aug 21, 2021

Understanding Flink’s Architecture: From APIs to Cluster Deployment

Flink System Architecture Overview

Flink’s architecture is divided into three layers: APIs & Libraries, Core, and Deploy. The APIs layer provides DataStream API for stream processing and DataSet API for batch processing. The Libraries layer builds domain‑specific frameworks such as CEP, Table API, Flink ML, and Gelly. The Core layer implements the runtime engine, handling job graph conversion, scheduling, and execution. The Deploy layer supports various deployment modes including Standalone, YARN, Kubernetes, and cloud platforms.

Programming Interfaces

Flink offers four programming interfaces. Flink SQL enables relational‑style queries for both stream and batch workloads. Table API offers a DSL similar to DataSet/DataFrame and can interoperate with DataStream/DataSet APIs. DataStream & DataSet APIs are the original low‑level APIs for unbounded and bounded data respectively. The Stateful Processing Function API gives fine‑grained control over state and timers but is more complex and usually unnecessary because rich operators already encapsulate its functionality.

Runtime Execution Engine

Jobs written with the above APIs are compiled into a JobGraph, submitted to the cluster, and executed by the runtime. The runtime includes a ResourceManager and TaskManagers that manage slots, schedule tasks, and support both bounded and unbounded jobs.

Physical Deployment Layer

The deployment layer abstracts different resource managers such as Standalone, Hadoop YARN, and Kubernetes, providing slot resources to TaskManagers.

Core Concepts

Stateful Computation

Stateful computation stores intermediate results in memory or external stores (e.g., RocksDB) so that subsequent operators can access them. This enables complex use‑cases like CEP, windowed aggregations, and incremental calculations.

Time Semantics and Watermarks

Flink distinguishes event time, processing time, and ingestion time. Event time reflects the timestamp embedded in the data and is used with watermarks to handle out‑of‑order events. Processing time uses the machine’s clock, offering low latency but no deterministic ordering. Ingestion time is generated when data enters Flink and can be used when event time is unavailable.

Conclusion

Understanding Flink’s layered architecture, rich programming interfaces, runtime engine, deployment options, and core concepts such as stateful processing and time semantics is essential for building robust stream and batch applications.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

architecture Flink stream processing Stateful Computing Time Semantics

Written by

Python Crawling & Data Mining

Life's short, I code in Python. This channel shares Python web crawling, data mining, analysis, processing, visualization, automated testing, DevOps, big data, AI, cloud computing, machine learning tools, resources, news, technical articles, tutorial videos and learning materials. Join us!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.