Why Alibaba Chose Apache Flink: A Deep Dive into Its Big Data Journey
This article explains how Alibaba adopted Apache Flink as a unified, low‑latency, high‑throughput big data engine, covering its origins, technical advantages over Spark, large‑scale deployment, state management, checkpointing, API unification, and future directions in streaming and batch processing.
Why Alibaba Chose Flink
With the explosion of data in the digital era, Alibaba needed a unified engine that could handle both batch and streaming workloads without requiring developers to write separate code bases. Apache Flink, a low‑latency, high‑throughput stream processing engine with exactly‑once semantics, became the solution.
Background and Comparison
Traditional big‑data engines either focus on batch (Spark, Hive) or streaming (Storm, Samza). Spark simulates streaming on top of batch, while Flink adopts a stream‑first approach that can also simulate batch, offering better scalability and extensibility.
Alibaba’s Deployment
Flink was first launched in Alibaba in 2016 for search and recommendation, and now runs across all Alibaba businesses on Hadoop YARN and HDFS. The platform processes billions of events per second with millisecond latency and provides exactly‑once consistency, meeting financial‑grade reliability.
Scale has grown from a few hundred servers to tens of thousands, handling petabytes of state data and processing over a trillion events daily, with peak throughput exceeding 4.7 × 10⁸ events per second during major sales events.
Core Technologies
Flink’s state management stores variables such as counters, offsets, and aggregates internally, eliminating external dependencies and improving performance. Periodic checkpoints, persisted to distributed storage, enable fault‑tolerant recovery without data loss, using the classic Chandy‑Lamport algorithm.
Event‑time processing and watermarks address out‑of‑order data, allowing accurate ordering based on the original occurrence time rather than arrival time.
Contributions to the Community
Alibaba re‑architected Flink’s distributed scheduler to run on both YARN and Kubernetes, introduced a layered job scheduling model, and implemented incremental checkpoints that persist only changed state, greatly improving stability at large scale.
Unified API and SQL
To bridge the gap between batch and streaming APIs, Alibaba proposes a DAG‑based unified API where developers define dataflow without distinguishing between batch and stream semantics. A unified SQL layer treats streams as continuously updating tables and batches as static tables, enabling a single query language for both.
Future Directions
Alibaba aims to strengthen Flink’s batch capabilities, achieve seamless batch‑stream switching, expand language support beyond Java and Scala to Python and Go, and integrate machine‑learning libraries for end‑to‑end AI pipelines.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
