Big Data 25 min read

Apache Paimon: Core Capabilities, Table Types, LSM Tree, Buckets, Merge Engines, and Operational Details

This article provides a comprehensive overview of Apache Paimon, covering its real‑time lake ingestion, unified stream‑batch processing, table types (primary‑key and append‑only), LSM‑tree storage, bucket mechanisms, merge‑engine options, compaction strategies, concurrency control, consumption methods, tag management, data cleanup, and system tables for big‑data workloads.

Big Data Technology & Architecture

Jan 2, 2025

Apache Paimon: Core Capabilities, Table Types, LSM Tree, Buckets, Merge Engines, and Operational Details

This series of articles is designed for interview preparation and self‑assessment, offering a deep dive into Apache Paimon fundamentals.

Paimon's Core Capabilities as a Lake Storage

Apache Paimon is a unified stream‑batch data lake storage format that integrates with Flink and Spark to build real‑time lake‑warehouse architectures. It combines lake formats with LSM‑tree technology to provide real‑time lake updates and full stream processing capabilities:

Real‑time Ingestion : Supports real‑time synchronization from multiple databases, including MySQL, with high efficiency and low latency even at tens of millions of records.

Unified Stream‑Batch Processing : Leverages Flink for streaming and Spark for batch, offering consistent data semantics and reduced cost.

Broad Ecosystem Integration : Tight integration with many compute components such as Flink and Spark.

Efficient Query : Uses Deletion Vectors and indexes to boost query performance for streaming, batch, and OLAP workloads.

Table Types in Paimon

Paimon provides two main categories of tables:

Primary‑Key Tables : Define a primary key, allowing insert, update, and delete operations. Data is bucketed; each bucket contains an independent LSM‑tree and changelog file. Recommended bucket size is 200 MB‑1 GB.

Append‑Only Tables : Created when no primary key is defined. Only whole‑record inserts are allowed, suitable for log‑type data.

Append‑only tables support two modes: Scalable (recommended) and Queue.

What Is an LSM Tree?

An LSM (Log‑Structured Merge) tree is a data structure optimized for write‑intensive workloads. It consists of an in‑memory Memtable and on‑disk SSTables. Writes go to Memtable, which flushes to SSTable when full; reads merge sorted runs from newest to oldest.

Paimon stores data files using LSM trees, organizing them into sorted runs where each run’s key range does not overlap.

Bucket Explanation

Bucket (桶)

Concept: A table (partitioned or not) is further divided into multiple buckets to improve query efficiency.

Structure: Each bucket directory contains an LSM tree and its changelog file.

Partitioning: Bucket assignment is based on the hash of one or more columns (bucket‑key). If not specified, the primary key or the whole record is used.

Parallelism: Bucket is the smallest read/write unit; too many buckets cause many small files and degrade read performance. Recommended data size per bucket is 200 MB‑1 GB.

Fixed Bucket

Mechanism: Set a positive bucket count; records are assigned using Math.abs(key_hashcode % numBuckets).

Scalability: Changing the bucket count requires an offline process; too many buckets create many small files, too few reduce write performance.

Dynamic Bucket

Default mode for primary‑key tables, or enable with bucket = '-1'.

Allocation Strategy: New data may go to new buckets while old data stays in old buckets; Paimon uses an index to map keys to buckets and can automatically expand the bucket count.

Configuration Options:

<code style="padding: 16px; color: #abb2bf; font-family: Consolas, Monaco, Menlo, monospace; font-size: 12px">dynamic-bucket.target-row-num：controls the target number of rows per bucket<br/>dynamic-bucket.initial-buckets：controls the initial number of buckets<br/></code>

Limitation: Dynamic bucket supports a single write job; running multiple jobs on the same partition may cause duplicate data.

Dynamic Bucket Modes and Cross‑Partition Updates

In the normal dynamic bucket mode, updates that do not cross partitions use a HASH index to maintain key‑bucket mapping, requiring about 1 GB memory for 100 million entries. Cross‑partition upserts need an index TTL configuration and may cause data duplication.

Three cross‑partition upsert modes are available:

<code style="padding: 16px; color: #abb2bf; font-family: Consolas, Monaco, Menlo, monospace; font-size: 12px">Deduplicate：Delete old‑partition data and insert new‑partition data.<br/>PartialUpdate & Aggregation：Insert new data into the old partition.<br/>FirstRow：Ignore new data if an old value exists.<br/></code>

Best Partition Key Choices for Production

Creation Time (recommended) : Immutable and can be added to the primary key.

Event Time : Suitable for CDC data; can be part of the primary key.

CDC Operation Timestamp (op_ts) : Not recommended as a partition key because it prevents deduplication and incurs extra resources.

Core principle: the partition key should be immutable and unique.

Paimon Table Modes and Their Characteristics

Paimon offers three write‑path modes for primary‑key tables, each built on LSM trees:

MOR (Merge On Read) : Default mode; small merges are performed during reads. Write performance is excellent, read performance is moderate.

COW (Copy On Write) : Full merge on each write; read performance is excellent, write performance suffers.

MOW (Merge On Write) : Generates deletion‑vector files on write; both read and write performance are good.

MOR: Suitable for write‑heavy, read‑insensitive scenarios.

COW: Suitable for read‑heavy, write‑light scenarios.

MOW: Balanced read/write workloads.

File Composition of a Paimon Table

<code style="padding: 16px; color: #abb2bf; font-family: Consolas, Monaco, Menlo, monospace; font-size: 12px">warehouse
└── default.db
    └── my_table
        ├── bucket-0
        │   └── data-59f60cb9-44af-48cc-b5ad-59e85c663c8f-0.orc
        ├── index
        │   └── index-5625e6d9-dd44-403b-a738-2b6ea92e20f1-0
        ├── manifest
        │   ├── index-manifest-5d670043-da25-4265-9a26-e31affc98039-0
        │   ├── manifest-6758823b-2010-4d06-aef0-3b1b597723d6-0
        │   ├── manifest-list-9f856d52-5b33-4c10-8933-a0eddfaa25bf-0
        │   └── manifest-list-9f856d52-5b33-4c10-8933-a0eddfaa25bf-1
        ├── schema
        │   └── schema-0
        └── snapshot
            ├── EARLIEST
            ├── LATEST
            └── snapshot-1
</code>

Key file types:

Snapshot Files : Capture table state at a specific point.

Manifest Files : List and describe data and changelog files.

Data Files : Stored per partition and bucket; formats include ORC (default), Parquet, Avro.

Partition Metadata : Optional Hive‑style partitioning.

LSM Trees : Underlying storage structure.

Concurrency Control in Paimon

Paimon supports optimistic concurrency control for multiple concurrent write jobs. Each job writes at its own pace and creates a new snapshot upon commit by applying incremental files.

Merge Engine Options for Primary‑Key Tables

deduplicate (default) : Keeps only the latest record for a given primary key.

first-row : Retains the first record encountered for a primary key.

aggregation : Aggregates non‑key columns using user‑specified functions (default is last_non_null_value).

partial-update : Allows incremental updates where null values do not overwrite existing data.

Handling Out‑of‑Order Data

By default, Paimon assumes input order is the merge order. To handle out‑of‑order streams, set the 'sequence.field' = '<column-name>' property; the column must be of type TINYINT, SMALLINT, INTEGER, BIGINT, TIMESTAMP, or TIMESTAMP_LTZ.

Compaction Strategies

Asynchronous Compaction : Configurable via parameters such as num-sorted-run.stop-trigger, sort-spill-threshold, and lookup-wait. Allows high write throughput while compaction runs in the background.

Full Compaction : Uses universal compaction; can be triggered periodically with compaction.optimization-interval or after a certain number of delta commits.

<code style="padding: 16px; color: #abb2bf; font-family: Consolas, Monaco, Menlo, monospace; font-size: 12px">num-sorted-run.stop-trigger = 2147483647<br/>sort-spill-threshold = 10<br/>lookup-wait = false<br/></code>

<code style="padding: 16px; color: #abb2bf; font-family: Consolas, Monaco, Menlo, monospace; font-size: 12px">compaction.optimization-interval：interval for periodic full compaction<br/>full-compaction.delta-commits：trigger full compaction after a number of delta commits<br/></code>

Compaction balances write amplification and read amplification; frequent compaction reduces read amplification but increases write cost, and vice versa.

Consumption Methods for Paimon Tables

Streaming consumption (from a specific offset or consumer ID).

Batch consumption, including time‑travel queries via scan.timestamp-millis hint and incremental queries between snapshots via incremental-between hint.

Tag Functionality

Tags label specific table versions. Creation mode can be set with tag.automatic-creation (process‑time, watermark, batch). Creation period can be daily, hourly, or two‑hours. Tags can be auto‑deleted using tag.num-retained-max or tag.default-time-retained.

Data Cleanup Strategies

Adjust snapshot file expiration to remove obsolete data.

Set partition expiration time for time‑based partitions.

Manually clean up abandoned temporary files left by failed jobs.

Paimon System Tables

Metadata system tables:

Snapshots: Information about each snapshot file.

Schemas: Current and historical table schemas.

Options: Table configuration parameters.

Partitions: Partition list, row counts, and file sizes.

Files: Details of data files referenced by snapshots.

Tags: Information about each tag and its associated snapshot.

Special consumption system tables:

Read‑Optimized: Provides data without in‑memory merging, improving query speed at the cost of freshness.

Audit Log: Records insert and delete operations for each row.

For more community resources and interview preparation material, refer to the linked Knowledge Planet and related articles.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Flink LSM‑Tree Streaming Spark Apache Paimon

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.