Big Data 14 min read

Open Source Big Data Platform 3.0: Streaming Lakehouse, Serverless Architecture, and AI Integration

The talk outlines the evolution of Alibaba Cloud's open‑source big data platform from Hadoop‑based EMR to a 3.0 architecture featuring a streaming lakehouse, full serverless compute and storage, AI‑driven operations, and upcoming vector search services, highlighting technical motivations, challenges, and product releases.

Big Data Technology Architecture

Nov 14, 2023

Open Source Big Data Platform 3.0: Streaming Lakehouse, Serverless Architecture, and AI Integration

Speaker and Topic : Wang Feng, Alibaba Cloud researcher and head of the open‑source big data platform, presented “Open Source Big Data Platform 3.0 Technical Interpretation”.

Historical Background : The platform originated from internal Hadoop usage in 2009, evolved into the first cloud‑native product EMR (1.0), and later incorporated Apache Flink for real‑time streaming, marking the 2.0 stage with a data‑lake‑centric architecture.

3.0 Technical Explorations :

**Streaming Lakehouse** – a unified real‑time data‑warehouse architecture that merges streaming analytics with lakehouse storage.

**Serverless Transformation** – all core compute and storage components have been re‑engineered to be serverless, enabling elastic resource allocation.

**AI Integration** – AI is embedded for intelligent operations, data management, and future vector‑search services.

Streaming Lakehouse Details : The new Lakehouse separates storage and compute, offering better scalability and query performance, but current formats (Iceberg, Delta, Hudi) limit fine‑grained real‑time updates. To address this, the community launched Apache Paimon, a lake‑format designed for real‑time upserts and compatible with Flink, Spark, Presto, and StarRocks, delivering >4× upsert and >10× scan speedups over Hudi.

Serverless Compute Products :

**Serverless Flink** – integrates with Alibaba Cloud storage, provides a one‑click SQL development platform, and offers 2–3× performance over open‑source Flink.

**Serverless StarRocks** – a fully managed OLAP engine with vectorized C++ execution, supporting massive concurrency.

**Serverless Spark** – a fully managed Spark service that combines Flink and StarRocks advantages, includes a serverless data service based on Celeborn, and eliminates local disk dependencies.

Serverless Storage :

**OSS‑HDFS** – a fully managed, HDFS‑compatible file system built on OSS, providing infinite‑scale cloud HDFS with pay‑as‑you‑go elasticity.

AI‑Driven Operations : Tools such as EMR Doctor and Flink Advisor use knowledge bases and machine‑learning models to automatically diagnose cluster issues, reducing problem‑identification time by 30% and improving resource utilization by 75%.

Future Directions : A fully managed serverless vector‑search service based on Milvus will be introduced, completing the AI‑enabled big data ecosystem.

The presentation concluded with a call to adopt these innovations to serve customers and gather feedback.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Serverless Big Data cloud-native Streaming open-source Lakehouse

Written by

Big Data Technology Architecture

Exploring Open Source Big Data and AI Technologies

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.