Big Data 7 min read

Ensuring Timeliness and Consistency in Apache Paimon: Snapshots, Expiration, and Optimization Strategies

This article explains how Apache Paimon guarantees data timeliness and consistency through snapshot files, two‑phase commit, and configurable expiration policies, and it outlines practical optimization and cleanup techniques for maintaining efficient storage and query performance.

Big Data Technology & Architecture

Jan 6, 2025

Ensuring Timeliness and Consistency in Apache Paimon: Snapshots, Expiration, and Optimization Strategies

This article is part of the Apache Paimon interview preparation series (advanced edition) and introduces key mechanisms that ensure the system’s timeliness and consistency.

Paimon's Timeliness and Consistency Guarantees

Timeliness is achieved via snapshot files, which serve as entry points for reading table data at specific points in time. Consumers can query the state of a table as of the snapshot creation moment, enabling time‑travel queries and stream processing position adjustments.

How timeliness is ensured

When data is written, the Paimon writer buffers it in memory and temporary files. After a Flink checkpoint is created, the temporary files are committed, producing a snapshot file. Downstream stream consumers monitor the snapshot list and read new snapshots when they appear. Consequently, the timeliness of Paimon depends on the snapshot generation frequency, which is tied to the Flink checkpoint interval. It is recommended to set the checkpoint interval between 1 and 10 minutes, balancing freshness and job efficiency.

How consistency is ensured

Paimon uses a two‑phase commit protocol to atomically commit data. For two Flink jobs writing to the same table:

If they modify different buckets, commits can proceed concurrently, providing sequential consistency.

Strongly discouraged: If both jobs modify the same bucket, Paimon resolves conflicts by job failover, guaranteeing only Snapshot Isolation . The final table state may be a mix of both job results, but no data loss occurs.

Detailed Explanation of Paimon's Snapshots System Table

The Snapshots system table records snapshot metadata, which is essential for version control, time‑travel, and state management in streaming jobs.

Snapshots can be queried directly via SQL, for example: SELECT * FROM mycatalog.mydb.`mytbl$snapshots`; Common columns of the Snapshots table are illustrated in the following diagram:

How Paimon Cleans Up Expired Data and Why It Matters

Large volumes of data generate many snapshots and old file versions, consuming storage and degrading query performance. Paimon provides three main ways to control data expiration:

Adjust the expiration time of snapshot files, either by retaining a specific number of snapshots or by keeping snapshots for a defined duration.

When using partitioned tables, the expiration can follow Hive‑style policies by setting a past‑time threshold for each partition. The three table parameters that control partition expiration are shown in the following diagram:

Manually remove orphan files that are not referenced by any snapshot. This can be done with a SQL call, for example:

CALL `<catalog-name>`.sys.remove_orphan_files('<database-name>.<table-name>');

In practice, scheduled cleanup jobs can be added to a Paimon cluster to automatically delete such orphan files.

Common Optimization Strategies for Paimon

For a concise list of performance‑tuning tips, refer to the article "Paimon Performance Optimization Summary" (link provided in the original source).

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Flink snapshot Consistency data expiration Apache Paimon

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.