Big Data 16 min read

How Flink Became the Real‑Time Big Data Standard – Insights from Alibaba’s Wang Feng

This interview with Alibaba researcher Wang Feng (aka Mo Wen) explores Apache Flink’s eight‑year journey to top‑level Apache status, its unified stream‑batch architecture, the rise of Flink Table Store and CDC, and how cloud‑native deployments are reshaping real‑time big data processing.

Programmer DD

Nov 26, 2022

How Flink Became the Real‑Time Big Data Standard – Insights from Alibaba’s Wang Feng

From Stream Computing to Unified Stream‑Batch Computing

After beating Storm and Spark Streaming, Flink became the sole standard for stream computing with no technical rivals.

Apache Flink, a real‑time big data analysis engine supporting both stream and batch modes, joined the Apache Software Foundation as a top‑level project eight years ago. Alibaba has been a major driver since 2015, contributing extensive code (including the Blink fork) and scaling Flink to handle 40 billion records per second during the 2021 Double‑11 event.

Flink’s early advantage came from its stateful stream processing and distributed snapshot technology, which provided strong consistency guarantees. Unlike Spark Streaming, which builds on the batch‑oriented Spark engine, Flink offers true pure‑stream execution.

In batch processing, Flink has matured significantly. It now passes the TPC‑DS benchmark with performance comparable to mainstream batch engines, and the community continues to improve batch capabilities while leveraging its native stream strengths to deliver the best unified stream‑batch experience.

The need for unified stream‑batch arises from business demands for real‑time analytics, risk control, recommendation, and monitoring, which cannot wait for nightly batch jobs. Maintaining separate real‑time and offline pipelines leads to duplicated development, inconsistent data definitions, and inefficient resource usage.

Unified stream‑batch allows a single engine to handle both real‑time and offline workloads, ensuring consistent data semantics and enabling shared resource pools for tasks such as search and recommendation.

Stream‑Based Data Warehouse: A New Architecture

Unified stream‑batch is a technical concept.

Flink’s SQL layer now expresses unified semantics, letting users write one SQL statement that runs on both real‑time and batch data. However, storage remains fragmented: real‑time data is written to Kafka‑like systems, while batch data lands in Hive, Iceberg, or Hudi. This dual‑storage model still requires maintaining two pipelines.

Current industry lacks a production‑ready unified storage that supports efficient stream reads/writes and batch reads/writes. Apache Hudi is a leading lakehouse project but struggles with large‑scale updates. Consequently, the Flink community is focusing on Flink Table Store, launched at the end of 2021, to provide a truly unified storage solution.

Since 2022, Flink Table Store has released two versions, with contributions from Alibaba, ByteDance, and others, and is becoming the foundation for the next‑generation streaming data warehouse.

Fully Incremental Integrated Data Integration

Real‑time data integration accounts for roughly one‑third of streaming workloads worldwide. Traditional approaches require separate batch and stream tools, making fully incremental synchronization difficult.

Leveraging Flink’s unified stream‑batch execution, Flink CDC enables seamless, lock‑free, incremental data replication across databases, using Flink’s checkpointing for exactly‑once guarantees. The project now supports many databases (MySQL, Oracle, PostgreSQL, MongoDB, TiDB, PolarDB, OceanBase) and has attracted contributions from companies like NetEase, Tencent, and Bilibili.

Flink in the Cloud‑Native Era

With the rise of cloud‑native architectures, Flink has long been designed for Kubernetes, offering native resource scheduling, shuffle, and state management that fit containerized deployments. Kubernetes removes the need for Hadoop dependencies, simplifies operations, and enables multi‑tenant isolation and future serverless capabilities.

Alibaba Cloud now offers a cloud‑native Flink product built on Flink SQL, providing real‑time data warehouse, integration, risk control, and feature engineering solutions. The service also supports a serverless model where users pay only for the resources they consume, and a multi‑cloud PaaS serverless offering is slated for global beta soon.

Interview Guest Profile

Wang Feng (alias “Mo Wen”) is an Alibaba researcher who graduated from Beihang University in 2006. He leads Alibaba Cloud’s open‑source big data platform and serves as vice‑chair of the Alibaba Open‑Source Committee’s big data and AI direction. Since 2015, he has championed Flink in China, driving its adoption across Alibaba’s real‑time data pipelines and contributing hundreds of open‑source projects to the community.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Cloud Native Big Data stream processing real-time analytics Apache Flink Data Integration

Written by

Programmer DD

A tinkering programmer and author of "Spring Cloud Microservices in Action"

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.