Understanding Time Travel and Snapshot Retention in Lake Frameworks (Hudi & Paimon)
This article explains how lake frameworks like Hudi and Paimon implement Time Travel by recording older data versions, the snapshot retention policies that limit historical data access, and practical recommendations for managing snapshots and consumption patterns to reduce storage costs in large‑scale data warehouses.
This short article mainly discusses Time Travel in lake frameworks.
Time Travel refers to the framework recording older versions of data, allowing users to query data at a specific point in time, a capability supported by all lake frameworks.
We illustrate with Hudi and Paimon.
In Hudi, each DML operation generates a timeline instant commit file that records which files were modified, enabling retrieval of incremental or snapshot data for a given instant timestamp.
Similarly, Paimon creates a snapshot for each commit or data write, recording the data files of the current version.
Both frameworks enforce strict retention periods for snapshot files.
Example in Hudi:
clean.retain_commits表示数据写入到hudi的次数<br/><br/>checkpoint interval表示每次checkpoint的时间间隔<br/><br/>当_hoodie_commit_time距离现在的时间超过clean.retain_commits * checkpoint interval,那么数据会被清除。<br/><br/>例如现在设置checkpoint的时间间隔为5分钟,设置clean.retain_commits的次数为10次,那么通过_hoodie_commit_time最多只能查询到50分钟以前的数据,再之前的数据查不到。<br/>In Paimon:
快照的默认保留时间通常是1小时,具体取决于Paimon的配置选项snapshot.time-retained。<br/><br/>在超过默认的1小时快照保留时间后,这些快照会被清理,以节省存储空间。因此,一旦这些快照被删除或回收,就无法再使用 time travel 功能访问这些已删除快照对应的历史数据。<br/><br/>如果你需要延长快照保留时间以支持更长时间的 time travel,可以调整 snapshot.time-retained 配置参数。例如:snapshot.time-retained=24h,这会将快照的保留时间设置为 24 小时,允许在更长时间内进行数据回溯。<br/>Why does this happen? Historical data quickly expands storage usage and cost, so the feature’s real value lies in scenarios like unified ODS layer storage for batch and streaming in data warehouses.
Currently, most companies have not deployed this capability at scale; it remains largely theoretical, especially for enterprise‑level real‑time/offline unified storage.
To make it work, two actions are needed:
Effectively manage snapshot/commit files, especially for long‑term retention.
Define clear offline and real‑time consumption standards.
Implementing these steps can dramatically lower data‑warehouse storage costs, which for medium to large companies can run into hundreds of millions.
Such expertise is scarce; only a few professionals understand both the frameworks and deep business development requirements.
Key considerations include:
Choosing a snapshot retention strategy that matches business scenarios to control storage cost.
Handling long‑cycle complex version management, as frequent file merges, clean‑ups, and compactions increase management complexity and background job load.
Selecting the appropriate lake foundation (e.g., Hudi, Paimon, Iceberg) based on performance and cost trade‑offs.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
