Big Data 8 min read

Apache Paimon 0.8 Release: Deletion Vectors, File Index, Performance Boosts, and Flink/Spark Integration Enhancements

The article introduces Apache Paimon 0.8, highlighting new Deletion Vectors, a universal file index, memory and I/O optimizations, record‑level TTL, and integration improvements with Flink and Spark, while also discussing broader lake‑house performance trends and future directions.

Big Data Technology & Architecture

May 13, 2024

Apache Paimon 0.8 Release: Deletion Vectors, File Index, Performance Boosts, and Flink/Spark Integration Enhancements

Earlier this week we released Apache Paimon 0.8, and this article reviews its core new features and performance improvements.

Key changes in version 0.8 include:

1. 新增 Deletion Vectors，近实时更新与极速查询<br/>2. 调整 Bucket 默认值为 -1，提升新学者的易用性<br/>3. 新增通用文件索引机制，提升 OLAP 查询性能<br/>4. 优化读写流程的内存及性能，减少 IO 访问次数<br/>5. Changelog 文件单独管理机制以延长其生命周期<br/>6. 新增基于文件系统的权限系统，管理读写权限<br/>

Read Performance Improvements

Read Optimization 1

Paimon 0.8 introduces a Deletion Vectors mode ("deletion-vectors.enabled"='true') that dramatically speeds up primary‑key table reads, enabling near‑real‑time updates and ultra‑fast queries.

The mechanism works like Select Vectors in vectorized computation: during writes a vector records which rows are deleted, allowing reads to filter them out without costly file merges.

Read Optimization 2

A universal file index is now maintained separately, improving OLAP query performance, though it is still being refined.

This index can reduce the need for downstream pipelines such as Kafka→OLAP in scenarios where lake‑table query speed meets reporting requirements.

Record‑Level TTL Configuration

Traditional lake tables often only support partition‑level TTL, forcing the use of partitioned tables; Paimon now allows record‑level TTL via the record-level.expire-time setting, similar to HBase.

Integration Optimizations with Flink and Spark

Lookup Join Optimization

1. Flink Lookup Join in this version uses Hash Lookup to avoid RocksDB insertion overhead;<br/>2. Flink Lookup Join introduces max_pt mode to join the latest partition data<br/>

The current dimension‑table join capabilities still lag behind some storage systems.

Spark Query Optimization

Spark 使用 COW 技术支持了 Append 表的 DELETE 与 UPDATE，Spark DELETE 也支持了所有MergeEngines 的主键表。Spark DELETE 和 UPDATE 也支持 subquery 的条件。Spark COMPACT Procedure 支持了 <span style="color: #e6c07b; line-height: 26px">where</span> 的方式。<br/>

Overall, the lake‑house ecosystem continues to evolve, and we will keep sharing production best practices and future trends.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Flink Spark Apache Paimon Deletion Vectors

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Read Performance Improvements

Read Optimization 1

Read Optimization 2

Other Read/Write Optimizations

Record‑Level TTL Configuration

Integration Optimizations with Flink and Spark

Lookup Join Optimization

Spark Query Optimization

Big Data Technology & Architecture

How this landed with the community

Was this worth your time?

0 Comments