Big Data 10 min read

How StarRocks and Apache Paimon Build a True Lakehouse Native Engine

This article details the deep integration of StarRocks with Apache Paimon, describing the unified architecture, version evolution, performance enhancements, time‑travel queries, native readers/writers, distributed planning, and future roadmap for achieving lakehouse‑native analytics at scale.

StarRocks
StarRocks
StarRocks
How StarRocks and Apache Paimon Build a True Lakehouse Native Engine

As data lakes become a core infrastructure for digital transformation, efficiently querying heterogeneous data sources within a single compute engine is a key challenge. StarRocks addresses this by tightly integrating with Apache Paimon, forming a Lakehouse Native solution that supports multi‑source federated analysis while delivering significant gains in performance, functionality, and observability.

Unified Architecture

StarRocks adopts a unified catalog mechanism that manages both internal tables and external Paimon tables within a single engine. This design extends StarRocks' compute‑separation architecture, allowing queries on data stored in remote lakes to benefit from StarRocks' OLAP optimizations such as CPU‑level instruction acceleration, vectorized execution, and I/O caching. Consequently, the data lake transitions from a "cold storage" role to a high‑performance analytical source.

Evolution of StarRocks‑Paimon Support

StarRocks 3.1 : Introduced Paimon external tables via JNI, enabling basic reads and supporting Paimon materialized view acceleration and predicate push‑down.

StarRocks 3.2 : Added metadata cache in the FE planning stage, speeding up plan generation and supporting table‑ and column‑level statistics. Implemented partition‑level refresh for materialized views and native reader support for Paimon DV tables, improving read performance for Merge‑On‑Read (MOR) tables.

StarRocks 3.3 : Marked a critical step toward Lakehouse Native with multiple core features (detailed below).

Key Feature Enhancements

Time Travel : Supports VERSION AS OF and TIMESTAMP AS OF queries to retrieve historical snapshots, useful for data audit, rollback, and A/B testing.

Paimon Format Table : A Hive‑compatible table type that enables seamless migration of existing Hive tables to Paimon, with StarRocks automatically recognizing and efficiently querying them.

Native Reader/Writer : Replaces JNI‑based MOR table reads with a C++ native scanner using the Paimon CPP SDK, achieving >5× read performance and significant write throughput improvements.

Distributed Plan : When manifest files are numerous, the FE distributes parsing tasks across multiple CN nodes, allowing parallel predicate push‑down and linear scaling of planning resources.

DV Index Cache : Caches DV index objects per bucket, avoiding repeated deserialization and boosting QPS by over 80% in high‑concurrency scenarios.

Observability : Enhances the profiling system with metrics for plan‑stage cache hit rates, remote reads, predicate push‑down effectiveness, and file scans, as well as BE‑stage ratios of JNI vs. native reads to guide compaction and table format decisions.

Performance Optimizations

The native scanner eliminates JVM overhead, type conversion, and GC pauses, while the distributed plan reduces FE latency caused by massive manifest parsing. Index caching further mitigates read amplification for primary‑key tables, and detailed profiling helps identify bottlenecks such as high JNI usage.

Future Roadmap

The long‑term goal is to align Paimon query performance and experience with that of native StarRocks tables. While BE execution is already comparable (both using columnar formats like Parquet/ORC), the FE plan stage remains a challenge due to latency‑sensitive manifest reads. Future work will focus on smarter cache pre‑warming, asynchronous parsing, and metadata compression to stabilize and predict query latency, eliminating performance spikes.

Conclusion

The deep fusion of StarRocks and Paimon exemplifies the evolution of modern lakehouse architectures, moving beyond simple data lake access to a truly "lakehouse‑native" engine that unifies storage and compute, enriches functionality, and pushes performance to meet real‑world, high‑concurrency, low‑latency demands across industries.

StarRocksdata lakelakehousetime travelApache Paimon
StarRocks
Written by

StarRocks

StarRocks is an open‑source project under the Linux Foundation, focused on building a high‑performance, scalable analytical database that enables enterprises to create an efficient, unified lake‑house paradigm. It is widely used across many industries worldwide, helping numerous companies enhance their data analytics capabilities.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.