Big Data 9 min read

How StarRocks and Apache Paimon Unite to Build a True Lakehouse Native Engine

StarRocks and Apache Paimon have been progressively integrated across multiple releases, enabling a unified lakehouse architecture that supports multi-source federated analysis, time-travel queries, native readers/writers, distributed planning, and advanced profiling, while delivering performance gains that bring Paimon query speed on par with native StarRocks tables.

Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
How StarRocks and Apache Paimon Unite to Build a True Lakehouse Native Engine

Following the successful Streaming Lakehouse Meetup, the second online session on December 10 featured Alibaba Cloud Computing Platform engineer Zhang Qingyu presenting the deep integration of StarRocks and Apache Paimon and exploring how to build a genuine Lakehouse Native data engine.

StarRocks Data Lake Overall Architecture: Single Engine, Multi-Source Federated Analysis

StarRocks and Paimon share a unified catalog design, allowing StarRocks to manage both internal tables and external lake tables (e.g., Paimon) within a single engine and to execute cross-catalog federated queries. This extends StarRocks' compute‑separation architecture: although data resides in a remote lake, query execution fully leverages StarRocks' OLAP optimizations—CPU‑level instruction acceleration, vectorized execution, and IO‑layer caching—so the lake becomes a high‑performance analytical source rather than cold storage.

Evolution of StarRocks + Paimon

StarRocks 3.1 : Introduced Paimon external tables via JNI, providing basic read capability and supporting materialized view acceleration and predicate push‑down.

StarRocks 3.2 : Added a metadata cache to speed up plan generation, collected table‑ and column‑level statistics, enabled partition‑level refresh of materialized views, and allowed native reading of Paimon DV tables, dramatically improving read performance for high‑throughput, low‑latency scenarios.

StarRocks 3.3 : Marked a key step toward a Lakehouse Native solution, delivering multiple core features that are detailed below.

Latest Advances in StarRocks + Paimon

Feature Enhancements

Time Travel : Supports querying historical snapshots using VERSION AS OF or TIMESTAMP AS OF, valuable for data audit, rollback, and A/B testing.

Paimon Format Table : A Hive‑compatible table type that lets users migrate existing Hive tables to Paimon, with StarRocks seamlessly recognizing and efficiently querying them, reducing migration cost.

Performance Optimizations

Native Reader/Writer : Replaces JNI‑based Java processing with a C++ Paimon native scanner using the Paimon CPP SDK, achieving >5× read speedup for MOR tables and significantly higher write throughput.

Distributed Plan : When manifest files are numerous, the Frontend distributes parsing tasks to multiple Compute Nodes, enabling parallel predicate push‑down and linear scaling of plan‑stage performance.

DV Index Cache : Caches DV index objects at bucket level, avoiding repeated full‑manifest deserialization; this Java‑object cache reduces CPU and memory pressure and boosts QPS by over 80% in high‑concurrency scenarios.

Observability: Enhanced Profiling Metrics

StarRocks now provides a comprehensive profile metric system covering both plan and execution phases. Users can monitor manifest cache hit rates, remote read counts, predicate push‑down effectiveness, and scanned file numbers during planning, as well as the proportion of JNI versus native reads during execution, helping identify cache or compaction needs.

Future Plans: Align Performance with Native Tables

The long‑term goal is to make Paimon query performance and experience indistinguishable from querying StarRocks native tables. While BE execution gaps have largely closed, the FE planning stage still faces latency spikes due to manifest cache misses. Future work will focus on smarter cache pre‑warming, asynchronous parsing, and metadata compression to stabilize latency and eliminate performance “jitter”.

Conclusion

The deep integration of StarRocks and Paimon exemplifies the evolution of modern lakehouse architectures: it moves beyond merely “querying the lake” to truly “understanding the lake”, delivering unified architecture, rich feature set, and extreme performance optimizations that have been validated in high‑concurrency, low‑latency scenarios across e‑commerce, logistics, and finance workloads.

Performance optimizationReal-time analyticsStarRocksData integrationlakehouseApache Paimon
Alibaba Cloud Big Data AI Platform
Written by

Alibaba Cloud Big Data AI Platform

The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.