Big Data 23 min read

Comprehensive Guide to Apache Paimon and Advanced Flink Integration

This article provides an in‑depth overview of Apache Paimon as a streaming lakehouse, explains its core features, file layout, consistency guarantees, and offers detailed guidance on integrating and tuning Paimon with Apache Flink for both write and read performance, multi‑writer concurrency, table management, and bucket rescaling.

Big Data Technology & Architecture

Dec 8, 2023

Comprehensive Guide to Apache Paimon and Advanced Flink Integration

1.1 Introduction

Flink aims to combine its streaming compute capabilities with the advantages of the lakehouse architecture, creating a next‑generation Streaming Lakehouse that enables real‑time data flow on a data lake and offers a unified batch‑and‑stream development experience. The Flink Table Store (FTS) project was incubated into Apache Paimon (incubating) on March 12, 2023.

Apache Paimon is a streaming data‑lake platform that supports high‑speed ingestion, change‑log tracking, and efficient real‑time analytics.

Reading: Paimon can consume data from historical snapshots (batch mode), the latest offset (stream mode), or a hybrid incremental snapshot.

Writing: It supports stream sync from database change logs (CDC) and batch insert/overwrite from offline data.

Paimon integrates with Apache Hive, Apache Spark, Trino and other compute engines besides Flink.

Internally, Paimon stores columnar files on a file system or object storage and uses an LSM‑tree structure to support massive updates and high‑performance queries.

1.2 Core Features

1) Unified batch and stream processing – supports batch writes/reads, stream updates, and change‑log generation.

2) Data‑lake capabilities – low cost, high reliability, and scalable metadata.

3) Multiple merge engines – users can choose to keep the latest record, perform partial updates, or aggregate records.

4) Change‑log generation – Paimon can generate complete change logs from any data source, simplifying stream analytics.

5) Rich table types – supports primary‑key tables and append‑only tables that provide ordered stream reads as an alternative to message queues.

6) Schema evolution – full support for renaming and reordering columns.

1.3 Basic Concepts

Snapshot : Captures the state of a table at a specific point in time; users can query the latest snapshot or perform time‑travel to earlier snapshots.

Partition : Same concept as Hive; optional partitioning by columns such as date, city, or department. Partition keys must be a subset of the primary key if a primary key is defined.

Bucket : Within a (partitioned) table, data is further divided into buckets based on a hash of one or more columns. Buckets are the smallest read/write unit and affect parallelism; a typical bucket size is around 1 GB.

Consistency Guarantees : Paimon writers use a two‑phase commit protocol, generating up to two snapshots per commit. Commits that modify different buckets are serializable; commits that touch the same bucket provide snapshot isolation.

1.4 File Layout

All table files reside under a base directory and are organized hierarchically.

Snapshot Files : JSON files stored in the snapshot directory, containing the schema in use and a list of manifest lists for the snapshot.

Manifest Files : Stored in the manifest directory; manifest lists contain names of manifest files, each of which records information about LSM data files and change‑log files.

Data Files : Grouped by partition and bucket; Paimon currently supports ORC (default), Parquet, and Avro formats.

LSM Trees : Files are organized into Sorted Runs; each Sorted Run consists of one or more data files with non‑overlapping primary‑key ranges. Compaction merges Sorted Runs to limit their number and improve query performance.

2.1 Advanced Flink Integration – Write Performance

Write performance is closely tied to checkpointing. Recommendations include increasing checkpoint intervals, using batch mode, enlarging the write buffer, enabling buffer overflow, and adjusting bucket count for fixed‑bucket tables.

增加检查点间隔，或者仅使用批处理模式。
增加写入缓冲区大小。
启用写缓冲区溢出。
如果您使用固定存储桶模式，请重新调整存储桶数量。

Parallelism: The sink parallelism should be less than or equal to the number of buckets, preferably equal.

Compaction: When the number of Sorted Runs exceeds a threshold, the writer pauses writes to perform compaction. The threshold can be tuned via num-sorted-run.stop-trigger and sort-spill-threshold.

num-sorted-run.stop-trigger = 2147483647
sort-spill-threshold = 10

Memory usage is dominated by three areas: the writer’s shared buffer, memory consumed during compaction (adjustable via num-sorted-run.compaction-trigger), and memory for reading large rows during compaction.

2.2 Read Performance

Full compaction can be triggered periodically with the full-compaction.delta-commits option to ensure partitions are fully compacted before write completion.

Primary‑key tables use a MergeOnRead strategy, merging multiple LSM layers at read time; Append‑Only tables rely on automatic compaction to reduce small files.

Parquet reads are slightly faster than ORC due to query optimizations.

2.3 Multi‑Writer Concurrency

Paimon supports concurrent writes to different partitions. For concurrent writes to the same partition, a dedicated compaction job can be used to avoid write‑throughput instability and commit conflicts.

<FLINK_HOME>/bin/flink run \
/path/to/paimon-flink-action-0.5-SNAPSHOT.jar \
compact \
--warehouse ...
--database ...
--table ...
[--partition ...]
[--catalog-conf ...]

2.4 Table Management

Snapshot Expiration : Each commit creates one or two snapshots; expired snapshots are automatically cleaned up to free space.

Rollback : Use the rollback action to revert to a specific snapshot.

<FLINK_HOME>/bin/flink run \
/path/to/paimon-flink-action-0.5-SNAPSHOT.jar \
rollback-to \
--warehouse ...
--database ...
--table ...
--snapshot ...

Partition Management : Set partition.expiration-time to automatically drop expired partitions.

Small‑File Management : Small files increase NameNode pressure, storage cost, and query latency. Mitigation includes increasing checkpoint intervals, enlarging write-buffer-size, enabling write-buffer-spillable, and tuning compaction thresholds.

Bucket Rescaling

Users can change the total number of buckets with ALTER TABLE … SET ('bucket' = '…') and reorganize data using INSERT OVERWRITE without recreating the table.

-- rescale number of total buckets
ALTER TABLE table_identifier SET ('bucket' = '...');

-- reorganize data layout
INSERT OVERWRITE table_identifier [PARTITION (part_spec)]
SELECT ... FROM table_identifier [WHERE part_spec];

After changing bucket numbers, new streaming jobs must be paused (e.g., using a savepoint), the bucket count adjusted, and then a batch INSERT OVERWRITE job executed to rewrite existing data before resuming streaming.

Overall, the article walks through the architecture, core capabilities, file organization, and practical tuning tips for achieving high‑performance, reliable streaming lakehouse workloads with Apache Paimon and Flink.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Flink LSM‑Tree Data Lake streaming lakehouse Apache Paimon

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.