Big Data 9 min read

Real-time Big Data Analytics with Apache Paimon and the Streaming Lakehouse Architecture

This article summarizes Wang Feng's presentation on the next‑generation Lakehouse architecture, explaining how Apache Paimon provides a unified, real‑time data lake format that bridges batch and streaming workloads, enabling low‑latency analytics and AI integration for modern big‑data applications.

Big Data Technology & Architecture

Jun 16, 2024

Real-time Big Data Analytics with Apache Paimon and the Streaming Lakehouse Architecture

The article is compiled from Wang Feng's talk at the Streaming Lakehouse Meetup (May 16), covering five key topics: the concept of Data Lake + Data Warehouse = Data Lakehouse, Apache Paimon as a unified lake format, the past, present and future of Apache Paimon, the emergence of Streaming Lakehouse, and Paimon's designation as Alibaba's unified data lake format.

Lakehouse combines the strengths of data lakes and data warehouses, allowing unified storage of structured, semi‑structured, and unstructured data, and seamless integration with batch processing, streaming, OLAP, and machine‑learning/AI workloads.

To achieve real‑time analytics on a Lakehouse, two conditions are required: a real‑time compute engine and a data format that supports continuous updates. Existing lake formats such as Iceberg, Hudi, and Delta Lake are batch‑oriented and lack native real‑time update capabilities.

Apache Paimon addresses this gap by offering a streaming‑first storage format that supports low‑latency updates, CDC semantics, and batch operations, and integrates smoothly with engines like Apache Flink, Spark, Trino, Presto, and StarRocks.

Paimon originated from the Flink Table Store project, was incubated in the Apache Flink community, and later graduated to a top‑level Apache project in 2023, achieving integration with major analytics engines and planning a 1.0 release that unifies streaming, batch, and OLAP analysis while remaining compatible with formats like Iceberg.

The Streaming Lakehouse concept leverages Flink + Paimon to create an end‑to‑end real‑time data pipeline, reducing latency from hour‑level to seconds and enabling a true streaming Lakehouse architecture.

Within Alibaba, Paimon has been adopted as the unified data lake format across products such as Flink, Spark, StarRocks, MaxCompute, and Hologres, and will be offered as a cloud solution to help enterprises perform real‑time analytics at scale.

Finally, the presenter thanks the audience and encourages continued engagement with the Apache Paimon community.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data real-time analytics Streaming Apache Paimon

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.