Big Data 6 min read

Paimon 1.0 Lookup Performance Optimization and PFile File Format Overview

An overview of Paimon 1.0’s milestone improvements, focusing on the optimized local Lookup performance, the new sort‑lookup‑store based PFile key‑value format, its four‑part structure, and detailed write and read procedures that enhance large‑scale dimension table joins.

Big Data Technology & Architecture

Feb 18, 2025

Paimon 1.0 Lookup Performance Optimization and PFile File Format Overview

Hi everyone, we are back.

In 2025 our update frequency is not fast; we are focusing on large‑model knowledge and personal health, so the pace is less intense than previous years.

Future updates will be more relaxed, aiming at simple, popular‑science content for a broader audience, with the main effort on an advanced big‑data class and expanding into large‑model‑data integration.

Today we introduce the new features of Paimon 1.0.

Paimon 1.0 is a milestone stable release whose kernel optimizations are the primary focus; among its capability upgrades, the most important is the Lookup performance optimization.

Local Lookup is the foundational capability of Paimon’s point‑lookup LSM structure and underlies the following functions:

lookup changelog‑producer: point‑lookup of historical files to generate changelogs.

Primary‑key table deletion‑vectors mode: point‑lookup of historical files to generate deletion vectors.

Flink Lookup Join: when the join condition is the primary key of a dimension table, local Lookup is used.

Previous versions used HashFile for Lookup, which had two drawbacks:

Generating a HashFile required multiple disk copies during write.

The compression ratio of HashFile was poor.

Currently Paimon supports AVRO, ORC, and Parquet file formats. In earlier versions, when used for Lookup join, Paimon dynamically converted columnar data to a key‑value format.

This format works fine for tens of gigabytes of data, but performance degrades noticeably for very large dimension tables. Therefore, the new version introduces a sort‑lookup‑store based key‑value file format, and Paimon can download these key‑value files and retrieve data via primary‑key lookup.

The Paimon community names this file format PFile, which is used only for primary‑key tables. Configuring it on an append‑only table will cause Paimon to throw an exception. Below is the structure of a PFile.

A PFile consists of four parts: Data Index, Leaf Index, Level Block Index, and Trailer.

How are these files read and written?

Writing a PFile follows these steps:

Write key‑value records into a data block until the block is full.

Compress the bytes and flush them to storage.

Generate a block index and write it into the data index block.

Reading a PFile proceeds as follows:

First open the file and read the foot information, then from the foot read the block index and Bloom filter, search the specific data block for the given primary key, and finally read the data from that block.

For table lookup, the system checks whether the file exists on the local disk, downloads it if necessary, and then retrieves the key‑value pair for the given primary key.

These are the optimizations applied to primary‑key tables required for efficient Lookup joins.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Paimon File Format @lookup

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.