Big Data 18 min read

How Paimon Revamps Lakehouse Management and Supercharges Queries with StarRocks

This article details Tongcheng Travel's migration from Hive/Kudu/Hudi to Paimon for lakehouse integration, highlighting a 30% resource reduction, three‑fold write speed gains, significant query acceleration via StarRocks, the end‑to‑end architecture across ODS‑DWD‑DWS‑ADS layers, and future roadmap plans.

StarRocks

Dec 2, 2024

Background and Pain Points

Initially, Tongcheng Travel's data warehouse relied on Hive for offline analytics, supplemented by Apache Kudu to meet near‑real‑time needs. This dual‑storage approach caused data duplication, long Spark batch windows, and high SSD‑based storage costs.

Two copies of data (Hive & Kudu) could not be shared.

Batch‑oriented Spark jobs could not satisfy real‑time requirements.

Kudu's SSD storage incurred prohibitive costs.

In 2022, Hudi was introduced to provide near‑real‑time ODS updates and streaming reads, but it suffered from low write efficiency, heavy compaction overhead, and occasional data loss.

Adopting Paimon

In 2023, Paimon replaced Hudi. Compared with Hudi, Paimon offers:

Support for multiple table models (primary‑key, append‑only, partial‑update, aggregation) to fit diverse warehouse needs.

Various changelog modes (Lookup, Full Compaction) covering more business scenarios.

An ecosystem that enables query acceleration through StarRocks external tables.

Performance gains after migration include:

~30% reduction in ODS synchronization resource consumption.

Approximately three‑fold improvement in write throughput.

Significant query speedup, especially for point‑lookup workloads, thanks to ordered Paimon files.

Current ODS ingestion exceeds 2,000 tasks, with Paimon storing nearly 600 TB of data, and over 1,000 Hudi tables have been fully switched to Paimon.

Lakehouse Architecture Design

The storage foundation is built on a federated HDFS cluster. Data sources include binlog and server logs, which are streamed into Paimon ODS tables via Flink. After initial cleansing, Flink further processes ODS data into the DWD layer.

In the DWD layer, partial‑update tables widen data, and aggregation tables generate DWS‑level aggregates. DWS data can be accessed by StarRocks either through external table queries or by materializing into StarRocks local tables, achieving minute‑level latency.

The entire pipeline implements a unified batch‑stream model: storage is handled by Paimon, while computation is delegated to Flink. All layers are queryable via Flink SQL, Spark SQL, Trino, or StarRocks, providing high flexibility.

Application Scenarios

For real‑time order‑wide tables, multiple dimension tables (product, city, supplier) and extension tables (order tracking, payment, quantity) are joined and widened. Traditional Flink‑based multi‑stream joins suffered from complex state management, high memory usage, and difficult extensibility.

Using Paimon’s partial‑update model, streams are unioned and updates are applied at the storage layer, resulting in lower memory consumption, easier data correction, and straightforward addition of new tables.

SQL examples (omitted for brevity) illustrate creating a wide order table with merge‑engine='partial‑update' and using sequence groups to control per‑stream update ordering.

StarRocks Query Acceleration

After the lakehouse is built, the ADS layer relies on StarRocks for fast analytics. Compared with ClickHouse (good for single‑table queries but weak on joins) and Greenplum (concurrency bottlenecks), StarRocks delivers 2‑5× higher performance on TPCH workloads.

StarRocks external tables allow direct querying of Paimon data without data duplication. Benchmarks on a 10 GB TPCH dataset show 4‑10× speedup over Trino.

Storage‑Compute Separation

To address scaling limits of the traditional integrated architecture, StarRocks’ storage‑compute separation decouples data stored in remote HDFS/S3 from compute nodes. FE nodes handle metadata, while CN nodes focus on query execution, enabling elastic scaling without costly data migration.

In offline analysis, querying Hive external tables via StarRocks eliminates the need to maintain duplicate Hive and StarRocks internal tables, reducing storage and compute costs.

Future Plans

Upgrade Paimon from version 0.6 to 0.9 and adopt DV tables for further query efficiency.

Expand lakehouse usage to more real‑time and offline scenarios, providing guidance to lower the adoption barrier.

Continue enhancing StarRocks features such as intelligent materialized view creation, visual management platforms, and unified query engine consolidation.

Performance optimization big data Flink StarRocks Paimon lakehouse

Written by

StarRocks

StarRocks is an open‑source project under the Linux Foundation, focused on building a high‑performance, scalable analytical database that enables enterprises to create an efficient, unified lake‑house paradigm. It is widely used across many industries worldwide, helping numerous companies enhance their data analytics capabilities.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.