Open Source Big Data Platform 3.0: Streaming Lakehouse, Serverless Architecture, and AI Integration
The talk outlines the evolution of Alibaba Cloud's open‑source big data platform from Hadoop‑based EMR to a 3.0 architecture featuring a streaming lakehouse, full serverless compute and storage, AI‑driven operations, and upcoming vector search services, highlighting technical motivations, challenges, and product releases.
Speaker and Topic : Wang Feng, Alibaba Cloud researcher and head of the open‑source big data platform, presented “Open Source Big Data Platform 3.0 Technical Interpretation”.
Historical Background : The platform originated from internal Hadoop usage in 2009, evolved into the first cloud‑native product EMR (1.0), and later incorporated Apache Flink for real‑time streaming, marking the 2.0 stage with a data‑lake‑centric architecture.
3.0 Technical Explorations :
**Streaming Lakehouse** – a unified real‑time data‑warehouse architecture that merges streaming analytics with lakehouse storage.
**Serverless Transformation** – all core compute and storage components have been re‑engineered to be serverless, enabling elastic resource allocation.
**AI Integration** – AI is embedded for intelligent operations, data management, and future vector‑search services.
Streaming Lakehouse Details : The new Lakehouse separates storage and compute, offering better scalability and query performance, but current formats (Iceberg, Delta, Hudi) limit fine‑grained real‑time updates. To address this, the community launched Apache Paimon, a lake‑format designed for real‑time upserts and compatible with Flink, Spark, Presto, and StarRocks, delivering >4× upsert and >10× scan speedups over Hudi.
Serverless Compute Products :
**Serverless Flink** – integrates with Alibaba Cloud storage, provides a one‑click SQL development platform, and offers 2–3× performance over open‑source Flink.
**Serverless StarRocks** – a fully managed OLAP engine with vectorized C++ execution, supporting massive concurrency.
**Serverless Spark** – a fully managed Spark service that combines Flink and StarRocks advantages, includes a serverless data service based on Celeborn, and eliminates local disk dependencies.
Serverless Storage :
**OSS‑HDFS** – a fully managed, HDFS‑compatible file system built on OSS, providing infinite‑scale cloud HDFS with pay‑as‑you‑go elasticity.
AI‑Driven Operations : Tools such as EMR Doctor and Flink Advisor use knowledge bases and machine‑learning models to automatically diagnose cluster issues, reducing problem‑identification time by 30% and improving resource utilization by 75%.
Future Directions : A fully managed serverless vector‑search service based on Milvus will be introduced, completing the AI‑enabled big data ecosystem.
The presentation concluded with a call to adopt these innovations to serve customers and gather feedback.
Big Data Technology Architecture
Exploring Open Source Big Data and AI Technologies
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.