Big Data 14 min read

Why Alibaba Chose Apache Flink: A Deep Dive into Its Big Data Journey

This article explains how Alibaba adopted Apache Flink as a unified, low‑latency, high‑throughput big data engine, covering its origins, technical advantages over Spark, large‑scale deployment, state management, checkpointing, API unification, and future directions in streaming and batch processing.

Alibaba Cloud Developer

Oct 15, 2018

Why Alibaba Chose Apache Flink: A Deep Dive into Its Big Data Journey

Why Alibaba Chose Flink

With the explosion of data in the digital era, Alibaba needed a unified engine that could handle both batch and streaming workloads without requiring developers to write separate code bases. Apache Flink, a low‑latency, high‑throughput stream processing engine with exactly‑once semantics, became the solution.

Background and Comparison

Traditional big‑data engines either focus on batch (Spark, Hive) or streaming (Storm, Samza). Spark simulates streaming on top of batch, while Flink adopts a stream‑first approach that can also simulate batch, offering better scalability and extensibility.

Alibaba’s Deployment

Flink was first launched in Alibaba in 2016 for search and recommendation, and now runs across all Alibaba businesses on Hadoop YARN and HDFS. The platform processes billions of events per second with millisecond latency and provides exactly‑once consistency, meeting financial‑grade reliability.

Scale has grown from a few hundred servers to tens of thousands, handling petabytes of state data and processing over a trillion events daily, with peak throughput exceeding 4.7 × 10⁸ events per second during major sales events.

Core Technologies

Flink’s state management stores variables such as counters, offsets, and aggregates internally, eliminating external dependencies and improving performance. Periodic checkpoints, persisted to distributed storage, enable fault‑tolerant recovery without data loss, using the classic Chandy‑Lamport algorithm.

Event‑time processing and watermarks address out‑of‑order data, allowing accurate ordering based on the original occurrence time rather than arrival time.

Contributions to the Community

Alibaba re‑architected Flink’s distributed scheduler to run on both YARN and Kubernetes, introduced a layered job scheduling model, and implemented incremental checkpoints that persist only changed state, greatly improving stability at large scale.

Unified API and SQL

To bridge the gap between batch and streaming APIs, Alibaba proposes a DAG‑based unified API where developers define dataflow without distinguishing between batch and stream semantics. A unified SQL layer treats streams as continuously updating tables and batches as static tables, enabling a single query language for both.

Future Directions

Alibaba aims to strengthen Flink’s batch capabilities, achieve seamless batch‑stream switching, expand language support beyond Java and Scala to Python and Go, and integrate machine‑learning libraries for end‑to‑end AI pipelines.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Alibaba Apache Flink Unified Engine

Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.