Big Data 15 min read

How Flink’s Stream‑Batch Fusion Is Transforming Real‑Time Big Data

The article explores Apache Flink’s eight‑year journey to becoming a top‑level Apache project, Alibaba’s extensive contributions, the rise of stream‑batch unified computing, its impact on real‑time data integration, cloud‑native deployment, and the emerging Flink‑based data‑warehouse and serverless solutions.

Alibaba Cloud Big Data AI Platform

Nov 29, 2022

How Flink’s Stream‑Batch Fusion Is Transforming Real‑Time Big Data

Flink’s Evolution and Community Impact

Apache Flink, one of the most active big‑data projects, has been a top‑level Apache project for eight years. Originally incubated in 2014, it quickly graduated to a top‑level project and now serves as a real‑time analytics engine supporting both stream and batch modes.

Since 2015 Alibaba has been a major driver of Flink, deploying it in search scenarios in 2016 and continuously modifying it to handle massive scale. By 2017 Alibaba became the largest Flink user, and the Flink team grew to over a hundred members.

In 2019 Alibaba open‑sourced its internal Flink version Blink, contributing over a million lines of code, which greatly accelerated community development. During the 2021 Double‑11 event Flink processed 4 billion records per second, handling 7 TB of data each second.

From Stream Computing to Unified Stream‑Batch Computing

After defeating Storm and Spark Streaming, Flink became the sole standard for stream computing.

Flink’s early advantage came from its stateful stream processing and distributed snapshot technology, providing high‑performance pure stream execution and strong consistency guarantees.

Unlike Spark Streaming, which builds on batch engines and limits performance, Flink offers native stream execution. In batch mode, Flink now passes the TPC‑DS benchmark with performance comparable to mainstream batch engines, aiming to deliver the best combined stream‑batch experience.

Why Unified Stream‑Batch Matters

Traditional enterprises relied on offline batch jobs for daily reports, but modern digital services demand real‑time analytics for risk control, recommendation, and monitoring. Maintaining separate real‑time and offline pipelines leads to duplicated development and inconsistent business metrics.

A unified stream‑batch engine lets developers write a single program that serves both real‑time and offline use cases, ensuring consistent logic and data semantics, especially in high‑latency‑sensitive scenarios such as search and recommendation.

Full‑Incremental Data Integration with Flink

Real‑time data integration accounts for roughly one‑third of all stream‑processing workloads. Traditional stacks require separate tools for batch and streaming, making full‑incremental synchronization complex. Leveraging Flink’s stream‑batch fusion enables a single pipeline that continuously syncs data with exactly‑once guarantees.

Flink CDC, built on Flink’s checkpointing and incremental snapshot algorithms, provides lock‑free, zero‑downtime data replication across many databases (MySQL, Oracle, PostgreSQL, MongoDB, TiDB, PolarDB, OceanBase, etc.), and the project has attracted contributions from companies like NetEase, Tencent, and ByteDance.

Flink in the Cloud‑Native Era

Flink was designed with cloud‑native principles from the start, supporting Kubernetes deployment, containerized execution, and stateless resource scheduling. This enables easy deployment without Hadoop dependencies, better isolation, multi‑tenant management, and paves the way for serverless execution.

Adaptive resource scaling and automatic concurrency management further exploit cloud elasticity, allowing Flink jobs to adjust to workload fluctuations without manual provisioning.

Productization and Future Directions

Alibaba Cloud offers a cloud‑native Flink product built around Flink SQL, providing real‑time data warehousing, integration, risk control, and feature engineering for enterprises. The platform also supports serverless operation, charging only for consumed resources.

Upcoming plans include a multi‑cloud PaaS serverless Flink service and continued community focus on a unified stream‑batch storage layer (e.g., Flink Table Store) to close the gap between streaming and batch storage.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Cloud Native Big Data stream processing real-time analytics Apache Flink Data Integration

Written by

Alibaba Cloud Big Data AI Platform

The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.