How Open‑Source Big Data 3.0 Is Redefining Real‑Time, Serverless, and AI‑Driven Analytics
The talk outlines Alibaba Cloud's open‑source big data platform evolution to version 3.0, highlighting the streaming lakehouse architecture, full serverless transformation, and AI‑enhanced operations that together enable real‑time analytics, higher performance, and smarter data management.
Speaker : Wang Feng, Alibaba Cloud researcher and head of the open‑source big data platform.
Key Insight : Real‑time processing and serverless are inevitable trends for the next generation of open‑source big data platforms.
From Hadoop to Cloud‑Native 3.0
Alibaba's open‑source big data platform began in 2009 with Hadoop (internal name "Yunti"). The first cloud product, EMR (E‑MapReduce), marked the 1.0 era and the start of cloud‑native development.
With the rise of real‑time needs, Apache Flink was adopted, becoming a global standard for streaming analytics. EMR evolved to a Flink‑based real‑time service, and the platform shifted to a data‑lake‑centric architecture, defining the 2.0 era.
In 2023 the team explored three 3.0 directions: a next‑generation Streaming Lakehouse, full serverless of all core compute and storage components, and deep AI integration for intelligent operations.
Next‑Generation Streaming Lakehouse
The new Streaming Lakehouse combines real‑time analytics with a lakehouse design, offering compute‑storage separation, better scalability, and improved query performance.
Current lakehouse formats (Iceberg, Delta, Hudi) focus on batch workloads, limiting fine‑grained real‑time updates. To address this, the team created Apache Paimon, a truly real‑time‑update‑oriented lake format that works with Flink, Spark, Presto, and StarRocks, delivering up to 4× faster upserts and 10× faster scans compared with Hudi.
By integrating Flink and Paimon, the platform provides end‑to‑end real‑time data ingestion, ETL, and analytics using a unified SQL layer, while remaining open to other engines such as Spark and Presto.
Full Serverless Transformation
Serverless has been explored for years; the first serverless product was Serverless Flink. In 2023 four new serverless products were launched:
EMR Serverless StarRocks (OLAP)
EMR Serverless Spark (batch & streaming)
OSS‑HDFS (serverless HDFS compatible storage)
A fully managed serverless source‑data service compatible with HMS protocol
These products share a common serverless platform that abstracts heterogeneous hardware, provides multi‑tenant isolation, and enables rapid product iteration.
AI‑Powered Intelligent Operations
AI capabilities have been embedded into the platform through tools like EMR Doctor and Flink Advisor, reducing average issue detection time by 30% and improving resource utilization by 75%.
These tools capture operational knowledge in a knowledge base and apply machine‑learning models to automatically diagnose problems, suggest optimizations, and perform health checks.
Vector Search as a New Service
Recognizing the surge in vector‑based retrieval, the platform will offer a fully managed, serverless vector search service built on Milvus, integrated with Alibaba Cloud PAI and large‑model capabilities.
Conclusion
The 3.0 roadmap combines streaming lakehouse, full serverless, and AI integration to deliver a more real‑time, scalable, and intelligent big data ecosystem for customers.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Big Data AI Platform
The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
