Big Data 7 min read

Apache Paimon (Incubating): A Streaming Lakehouse Storage Project Overview

Apache Paimon, newly incubated by the Apache Software Foundation, combines Flink's real‑time streaming capabilities with open lakehouse storage formats, offering high‑throughput, low‑latency data ingestion, partial‑update merges, and seamless integration with engines like Flink, Spark, and Trino for unified batch and streaming analytics.

Big Data Technology & Architecture
Big Data Technology & Architecture
Big Data Technology & Architecture
Apache Paimon (Incubating): A Streaming Lakehouse Storage Project Overview

On March 12, 2023, the Flink Table Store project was accepted into the Apache Software Foundation incubator and renamed Apache Paimon (incubating).

As Flink matures, enterprises use it for streaming data processing, and lakehouse architectures are emerging. Paimon aims to combine Flink's streaming capabilities with lakehouse advantages, providing real‑time data flow on data lakes.

Existing lake storage formats target batch workloads and cannot meet streaming needs, so the community created Flink Table Store (FTS), a streaming‑oriented lake storage project, now contributed by companies like Alibaba Cloud, ByteDance, Confluent, Tongcheng Travel, Bilibili, and others.

Paimon uses open data formats (ORC, Parquet, Avro) and supports engines such as Flink, Spark, Hive, Trino, and future integrations with Doris and StarRocks.

Leveraging LSM for append‑only writes, Paimon delivers high‑performance large‑scale updates, merges, and queries, with a file organization illustrated below:

The LSM structure provides high‑performance updates via Minor Compaction, efficient ordered merges, and query acceleration through primary‑key file skipping.

Integration with Flink CDC enables real‑time synchronization of MySQL tables (including schema changes) to Paimon with minimal resource consumption.

Paimon's Partial‑Update engine merges streams by primary key, supporting both batch reads (with projection push‑down) and streaming reads (delivering fully merged rows).

Changelog generation ensures downstream consumers receive correct change events, even when input lacks full changelog information or when using partial‑update tables.

The project has released three versions, with version 0.4 planned for April, and continues to invest in real‑time capabilities, ecosystem expansion, and data‑warehouse completeness.

Acknowledgments are given to the Flink community, project champion Li Yu, mentors Qin Jiangjie, Robert Metzger, Stephan Ewen, and contributors from Alibaba, ByteDance, Confluent, Tongcheng Travel, Bilibili, and many others.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Real-TimeBig DataFlinkData LakeLSMstreaming lakehouseApache Paimon
Big Data Technology & Architecture
Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.