Big Data 6 min read

What’s New in Apache Paimon 2025? Core Performance, AI Integration & Real‑Time Lakehouse Updates

The 2025 Apache Paimon release brings major performance boosts, AI‑centric multimodal storage, deeper streaming‑batch integration, and broader engine compatibility, detailing query and write optimizations, memory management tweaks, and a unified lake format for structured and unstructured data.

Big Data Technology & Architecture

Nov 28, 2025

What’s New in Apache Paimon 2025? Core Performance, AI Integration & Real‑Time Lakehouse Updates

Core Performance Improvements

Query Performance Enhancements

In the 2025 release, Paimon introduces two key optimizations for dimension‑table Lookup Joins: a pure in‑memory cache mode that loads dimension data directly into TaskManager memory, eliminating RocksDB serialization overhead, and a bucket‑based shuffle optimization that processes only the relevant bucket data, reducing memory usage and load time when the dimension table is a Fixed Bucket table and the join key contains all bucket keys.

Additional query‑related upgrades include:

COUNT(*) optimization that returns statistics from the manifest without scanning data files.

Thin‑mode storage for primary‑key tables that removes duplicate key columns, saving space and speeding up queries.

Bitmap index push‑down to the Parquet reader at page granularity, dramatically improving filter effectiveness and performance.

Write Optimizations

Paimon deepens its small‑file compaction strategy by making the merging process fully asynchronous, alleviating write bottlenecks and back‑pressure caused by frequent checkpoints and a large number of tiny files.

Key configuration parameters such as write-buffer-spill.max-disk-size allow users to cap the total size of temporary local files during writes, preventing disk‑space exhaustion and OOM situations.

AI Integration and Multimodal Data Support

Unified Lake Format for Data + AI

The community proposes a Unified Lake Format for Data + AI, offering a single storage interface and metadata management layer that can hold structured, semi‑structured, and unstructured data in one lake, eliminating data migration and redundancy for AI workloads.

Lance File Format Support

Lance is a columnar format designed for AI applications; it retains the advantages of Parquet while optimizing for high‑dimensional vector, image, and text storage and retrieval. Integration with Lance enables Paimon to manage both traditional table data and AI‑specific feature vectors or training samples efficiently.

Real‑Time Lakehouse and Engine Compatibility

Paimon’s streaming‑batch unified engine has been strengthened to support high‑throughput streaming writes and snapshot reads, enabling incremental updates and near‑real‑time queries. In streaming mode, Paimon can act as a Flink sink, continuously ingesting changelog streams into the lake, while downstream query engines such as Flink, Spark, Hive, Trino, StarRocks, and Doris can read the incremental data in real time.

These enhancements position Paimon as a versatile, high‑performance lakehouse solution for both traditional BI and modern AI use cases.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Optimization big data Streaming AI integration Lakehouse Apache Paimon

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.