Big Data 14 min read

Open Source Big Data Platform 3.0: Streaming Lakehouse, Serverless Architecture, and AI Integration

The talk outlines the evolution of Alibaba Cloud's open‑source big data platform from Hadoop‑based EMR to a 3.0 architecture featuring a streaming lakehouse, full serverless compute and storage, AI‑driven operations, and upcoming vector search services, highlighting technical motivations, challenges, and product releases.

Big Data Technology Architecture
Big Data Technology Architecture
Big Data Technology Architecture
Open Source Big Data Platform 3.0: Streaming Lakehouse, Serverless Architecture, and AI Integration

Speaker and Topic : Wang Feng, Alibaba Cloud researcher and head of the open‑source big data platform, presented “Open Source Big Data Platform 3.0 Technical Interpretation”.

Historical Background : The platform originated from internal Hadoop usage in 2009, evolved into the first cloud‑native product EMR (1.0), and later incorporated Apache Flink for real‑time streaming, marking the 2.0 stage with a data‑lake‑centric architecture.

3.0 Technical Explorations :

**Streaming Lakehouse** – a unified real‑time data‑warehouse architecture that merges streaming analytics with lakehouse storage.

**Serverless Transformation** – all core compute and storage components have been re‑engineered to be serverless, enabling elastic resource allocation.

**AI Integration** – AI is embedded for intelligent operations, data management, and future vector‑search services.

Streaming Lakehouse Details : The new Lakehouse separates storage and compute, offering better scalability and query performance, but current formats (Iceberg, Delta, Hudi) limit fine‑grained real‑time updates. To address this, the community launched Apache Paimon, a lake‑format designed for real‑time upserts and compatible with Flink, Spark, Presto, and StarRocks, delivering >4× upsert and >10× scan speedups over Hudi.

Serverless Compute Products :

**Serverless Flink** – integrates with Alibaba Cloud storage, provides a one‑click SQL development platform, and offers 2–3× performance over open‑source Flink.

**Serverless StarRocks** – a fully managed OLAP engine with vectorized C++ execution, supporting massive concurrency.

**Serverless Spark** – a fully managed Spark service that combines Flink and StarRocks advantages, includes a serverless data service based on Celeborn, and eliminates local disk dependencies.

Serverless Storage :

**OSS‑HDFS** – a fully managed, HDFS‑compatible file system built on OSS, providing infinite‑scale cloud HDFS with pay‑as‑you‑go elasticity.

AI‑Driven Operations : Tools such as EMR Doctor and Flink Advisor use knowledge bases and machine‑learning models to automatically diagnose cluster issues, reducing problem‑identification time by 30% and improving resource utilization by 75%.

Future Directions : A fully managed serverless vector‑search service based on Milvus will be introduced, completing the AI‑enabled big data ecosystem.

The presentation concluded with a call to adopt these innovations to serve customers and gather feedback.

serverlessBig Datacloud-nativeAIStreamingOpen-sourceLakehouse
Big Data Technology Architecture
Written by

Big Data Technology Architecture

Exploring Open Source Big Data and AI Technologies

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.