Big Data 6 min read

Optimizing Primary‑Key and Append‑Scalable Tables in Paimon with Flink

This guide explains how to optimize Paimon primary‑key and Append‑Scalable tables in Flink by adjusting sink and source parallelism, checkpoint intervals, making small‑file merges fully asynchronous, changing file formats, and applying ordering strategies to improve both write and read performance.

Big Data Technology & Architecture

Dec 2, 2024

Optimizing Primary‑Key and Append‑Scalable Tables in Paimon with Flink

Paimon write‑job bottlenecks are often caused by small‑file merges. By default, Flink checkpoints wait for these merges to finish, which can lead to back‑pressure and reduced job efficiency.

Optimization tips include:

Adjust Paimon sink parallelism via the sink.parallelism SQL hint.

Modify Flink checkpoint settings: increase execution.checkpointing.interval, set execution.checkpointing.max-concurrent-checkpoints to 3, and consider business latency tolerance.

Make small‑file merges fully asynchronous so checkpoints no longer wait for merge completion.

Change table parameters (e.g., 'num-sorted-run.stop-trigger' = '2147483647', 'sort-spill-threshold' = '10', 'changelog-producer.lookup-wait' = 'false') via ALTER TABLE or SQL hints.

If OLAP queries are not needed, switch the file format to Avro and disable statistics collection with 'file.format' = 'avro', 'metadata.stats-mode' = 'none' to boost write efficiency.

For consumption jobs, adjust Paimon source parallelism using the scan.parallelism hint, and consider reading from the Read‑Optimized system table to avoid small‑file merge overhead.

Append‑Scalable tables have additional considerations:

Adjust sink parallelism similarly; monitor for data skew and set sink parallelism differently from upstream if needed.

For read jobs, increase source parallelism and sort data using Z‑order, Hilbert, or explicit order strategies to improve batch or OLAP query performance.

Example command to compact and order data:

<FLINK_HOME>/bin/flink run \
    -D execution.runtime-mode=batch \
    /path/to/paimon-flink-action-0.8.2.jar \
    compact \
    --warehouse <warehouse-path> \
    --database <database-name> \
    --table <table-name> \
    --order_strategy <orderType> \
    --order_by <col1,col2,...> \
    [--partition <partition-name>] \
    [--catalog_conf <paimon-catalog-conf> ...] \
    [--table_conf <paimon-table-dynamic-conf> ...]

Balancing write‑side and read‑side performance by tuning these parameters helps achieve efficient data ingestion and query execution in Paimon‑backed Flink pipelines.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Flink Streaming Batch Paimon Table Optimization

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.