Introducing Apache Paimon: An Open‑Source Streaming Lakehouse Storage Engine
Apache Paimon is an open‑source streaming data lake storage system that combines LSM‑based real‑time updates, open file formats, and deep integration with Flink, Spark, and Trino to deliver high‑throughput ingestion, low‑latency queries, and unified batch‑stream processing for modern big‑data workloads.
On March 12, 2023, the Flink Table Store project graduated to the Apache Software Foundation incubator and was renamed Apache Paimon (incubating).
Apache Paimon is an open‑source streaming data lake storage engine that provides high‑throughput, low‑latency data ingestion, streaming subscription, and real‑time query capabilities, and integrates with Flink, Spark, Trino and other compute engines.
It uses open file formats (ORC, Parquet, Avro) on distributed file systems and adopts an LSM‑based architecture combined with columnar storage to achieve large‑scale real‑time updates.
The LSM design enables high‑performance writes (minor compaction), efficient merges, and primary‑key‑based file skipping for fast queries.
Recent versions embed Flink CDC, allowing real‑time synchronization of MySQL tables (including schema changes) to Paimon with minimal resource consumption.
Paimon’s partial‑update engine merges streams by primary key to produce wide tables, supporting both batch reads with projection push‑down and streaming reads of fully merged data.
As a unified streaming‑batch storage, Paimon supports stream‑write/stream‑read and batch‑write/batch‑read, enabling OLAP queries on both historical and fresh data and providing changelog generation for accurate downstream processing.
Three versions of Flink Table Store have been released; version 0.4 of Paimon is planned for April, with ongoing investment in real‑time, ecosystem, and data‑warehouse completeness.
The project thanks contributors from Alibaba, ByteDance, Confluent, Tongcheng Travel, Bilibili, and the Apache Flink community, and provides contact links to the website, GitHub repository, and community chat groups.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.