Ensuring Timeliness and Consistency in Apache Paimon: Snapshots, Expiration, and Optimization Strategies
This article explains how Apache Paimon guarantees data timeliness and consistency through snapshot files, two‑phase commit, and configurable expiration policies, and it outlines practical optimization and cleanup techniques for maintaining efficient storage and query performance.
This article is part of the Apache Paimon interview preparation series (advanced edition) and introduces key mechanisms that ensure the system’s timeliness and consistency.
Paimon's Timeliness and Consistency Guarantees
Timeliness is achieved via snapshot files, which serve as entry points for reading table data at specific points in time. Consumers can query the state of a table as of the snapshot creation moment, enabling time‑travel queries and stream processing position adjustments.
How timeliness is ensured
When data is written, the Paimon writer buffers it in memory and temporary files. After a Flink checkpoint is created, the temporary files are committed, producing a snapshot file. Downstream stream consumers monitor the snapshot list and read new snapshots when they appear. Consequently, the timeliness of Paimon depends on the snapshot generation frequency, which is tied to the Flink checkpoint interval. It is recommended to set the checkpoint interval between 1 and 10 minutes, balancing freshness and job efficiency.
How consistency is ensured
Paimon uses a two‑phase commit protocol to atomically commit data. For two Flink jobs writing to the same table:
If they modify different buckets, commits can proceed concurrently, providing sequential consistency.
Strongly discouraged: If both jobs modify the same bucket, Paimon resolves conflicts by job failover, guaranteeing only Snapshot Isolation . The final table state may be a mix of both job results, but no data loss occurs.
Detailed Explanation of Paimon's Snapshots System Table
The Snapshots system table records snapshot metadata, which is essential for version control, time‑travel, and state management in streaming jobs.
Snapshots can be queried directly via SQL, for example: SELECT * FROM mycatalog.mydb.`mytbl$snapshots`; Common columns of the Snapshots table are illustrated in the following diagram:
How Paimon Cleans Up Expired Data and Why It Matters
Large volumes of data generate many snapshots and old file versions, consuming storage and degrading query performance. Paimon provides three main ways to control data expiration:
Adjust the expiration time of snapshot files, either by retaining a specific number of snapshots or by keeping snapshots for a defined duration.
When using partitioned tables, the expiration can follow Hive‑style policies by setting a past‑time threshold for each partition. The three table parameters that control partition expiration are shown in the following diagram:
Manually remove orphan files that are not referenced by any snapshot. This can be done with a SQL call, for example:
CALL `<catalog-name>`.sys.remove_orphan_files('<database-name>.<table-name>');In practice, scheduled cleanup jobs can be added to a Paimon cluster to automatically delete such orphan files.
Common Optimization Strategies for Paimon
For a concise list of performance‑tuning tips, refer to the article "Paimon Performance Optimization Summary" (link provided in the original source).
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
