Fundamental Concepts and File Layout of Paimon: Snapshots, Partitions, Buckets, Consistency, and Compaction
This article explains Paimon's core concepts—including snapshots, partitions, buckets, consistency guarantees, file layout, LSM‑tree organization, and compaction strategies—while also covering table management tasks such as snapshot expiration, rollback, partition expiration, and small‑file mitigation techniques.
Basic Concepts
Snapshot: A snapshot captures the state of a table at a specific point in time, allowing users to read the latest data or, via time travel, access earlier states.
Partition: Paimon adopts the same partition concept as Apache Hive. Partitions are optional and can be defined on columns such as date, city, or department. Each table may have one or more partition keys, which must be a subset of the primary key if a primary key is defined.
Bucket: Within a partition (or an unpartitioned table), data can be further divided into buckets to provide additional structure for more efficient queries. The bucket is determined by the hash of one or more columns specified by bucket-key; if omitted, the primary key or the whole record is used. Buckets are the smallest read/write unit, and the number of buckets limits parallelism—generally a bucket size of about 1 GB is recommended.
Consistency Guarantees
Paimon writers use a two‑phase commit protocol, generating up to two snapshots per commit.
Concurrent writers that modify different buckets produce serializable commits; if they modify the same bucket, only snapshot isolation is guaranteed, meaning the final table state may be a mix of both commits but no changes are lost.
File Layout
All files of a table reside under a base directory and are organized hierarchically. The layout includes:
Snapshot files : JSON files stored in the snapshot directory, containing the schema in use and a manifest‑list of all changes.
Manifest files : Stored in the manifest directory; a manifest‑list lists manifest file names, and each manifest file records information about LSM data files and changelog files.
Data files : Grouped by partition and bucket; each bucket directory contains an LSM tree and its changelog. Supported formats are ORC (default), Parquet, and Avro.
LSM Trees : Paimon uses Log‑Structured Merge trees. Data files are organized into Sorted Runs , each consisting of one or more data files with non‑overlapping primary‑key ranges. Queries must merge all Sorted Runs, resolving duplicate keys based on timestamps and the chosen merge engine.
Compaction
As more records are written, the number of Sorted Runs grows, degrading query performance and potentially exhausting memory.
Compaction periodically merges multiple Sorted Runs into a larger one, similar to RocksDB's universal compaction strategy.
Compaction is resource‑intensive; overly frequent compaction can slow writes. Paimon triggers compaction automatically during writes, but users can run dedicated compaction jobs.
Table Management
Managing Snapshots
Snapshot expiration : Each writer commit creates one or two snapshots. Old snapshots are retained for time‑travel but are removed once they expire, freeing disk space.
Snapshot rollback (example command):
<FLINK_HOME>/bin/flink run \
/path/to/paimon-flink-action-0.5-SNAPSHOT.jar \
rollback-to \
--warehouse \
--database \
--table \
--snapshot \
[--catalog-conf [--catalog-conf …]]Managing Partitions
When creating a partitioned table, you can set partition.expiration-time. Paimon periodically checks and deletes expired partitions.
Example DDL:
CREATE TABLE T (…) PARTITIONED BY (dt) WITH (
'partition.expiration-time' = '7 d',
'partition.expiration-check-interval' = '1 d',
'partition.timestamp-formatter' = 'yyyyMMdd'
);Managing Small Files
Flink checkpoints generate 1–2 snapshots per checkpoint, creating many small files if the checkpoint interval is short.
Writer buffer exhaustion also flushes data to DFS, producing small files. Enabling write-buffer-spillable can generate larger spill files.
Reducing snapshot retention time or configuring full‑compaction can mitigate small‑file proliferation.
For primary‑key tables, the number of Sorted Runs per bucket is controlled by num-sorted-run.compaction-trigger (default ~5 files per bucket).
Append‑only tables also perform automatic compaction, but bucketed append‑only tables may retain many small files.
Full‑compaction can be scheduled via the table property full-compaction.delta-commits and triggered with full-compaction.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
