What’s New in Apache Paimon 2025? Core Performance, AI Integration & Real‑Time Lakehouse Updates
The 2025 Apache Paimon release brings major performance boosts, AI‑centric multimodal storage, deeper streaming‑batch integration, and broader engine compatibility, detailing query and write optimizations, memory management tweaks, and a unified lake format for structured and unstructured data.
Core Performance Improvements
Query Performance Enhancements
In the 2025 release, Paimon introduces two key optimizations for dimension‑table Lookup Joins: a pure in‑memory cache mode that loads dimension data directly into TaskManager memory, eliminating RocksDB serialization overhead, and a bucket‑based shuffle optimization that processes only the relevant bucket data, reducing memory usage and load time when the dimension table is a Fixed Bucket table and the join key contains all bucket keys.
Additional query‑related upgrades include:
COUNT(*) optimization that returns statistics from the manifest without scanning data files.
Thin‑mode storage for primary‑key tables that removes duplicate key columns, saving space and speeding up queries.
Bitmap index push‑down to the Parquet reader at page granularity, dramatically improving filter effectiveness and performance.
Write Optimizations
Paimon deepens its small‑file compaction strategy by making the merging process fully asynchronous, alleviating write bottlenecks and back‑pressure caused by frequent checkpoints and a large number of tiny files.
Key configuration parameters such as write-buffer-spill.max-disk-size allow users to cap the total size of temporary local files during writes, preventing disk‑space exhaustion and OOM situations.
AI Integration and Multimodal Data Support
Unified Lake Format for Data + AI
The community proposes a Unified Lake Format for Data + AI, offering a single storage interface and metadata management layer that can hold structured, semi‑structured, and unstructured data in one lake, eliminating data migration and redundancy for AI workloads.
Lance File Format Support
Lance is a columnar format designed for AI applications; it retains the advantages of Parquet while optimizing for high‑dimensional vector, image, and text storage and retrieval. Integration with Lance enables Paimon to manage both traditional table data and AI‑specific feature vectors or training samples efficiently.
Real‑Time Lakehouse and Engine Compatibility
Paimon’s streaming‑batch unified engine has been strengthened to support high‑throughput streaming writes and snapshot reads, enabling incremental updates and near‑real‑time queries. In streaming mode, Paimon can act as a Flink sink, continuously ingesting changelog streams into the lake, while downstream query engines such as Flink, Spark, Hive, Trino, StarRocks, and Doris can read the incremental data in real time.
These enhancements position Paimon as a versatile, high‑performance lakehouse solution for both traditional BI and modern AI use cases.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
