Big Data 12 min read

How StarRocks Beats Trino: 4.3× Faster Queries on Apache Paimon Lakehouse

This article explains how to build a high‑performance data‑lake analytics stack by combining StarRocks with Apache Paimon, covering direct queries, Data Cache acceleration, and asynchronous materialized views, and presents benchmark results that show StarRocks achieving up to 4.3× faster query speeds than Trino and significant latency reductions with caching and materialized views.

StarRocks

Apr 25, 2024

Background

Apache Paimon is a lake format built on a log‑structured merge‑tree (LSM) architecture, providing real‑time updates for lakehouse workloads. StarRocks is an open‑source MPP database with a fully vectorized engine and cost‑based optimizer, capable of querying external data sources such as Paimon without data migration.

Direct query of Paimon tables

Create a filesystem‑based Paimon external catalog in StarRocks:

CREATE EXTERNAL CATALOG paimon_fs_catalog
PROPERTIES (
    "type" = "paimon",
    "paimon.catalog.type" = "filesystem",
    "paimon.catalog.warehouse" = "oss://your-bucket/paimon/warehouse"
);

Example query (TPC‑H Q1) on a 100 GB Append‑Only ORC table runs in 148.92 s on StarRocks versus 640.8 s on Trino (≈4.3× faster).

Data Cache acceleration (StarRocks 2.5+)

Configure the BE to enable the data cache and allocate disk/memory space:

# Enable data cache
datacache_enable=true
# Disk cache size (e.g., 20 GB)
datacache_disk_size=21474836480
# Memory cache size (e.g., 4 GB)
datacache_mem_size=4294967296
# Disk paths for cache storage
datacache_disk_path=/mnt/disk1/starrocks/storage/datacache;/mnt/disk2/starrocks/storage/datacache;/mnt/disk3/starrocks/storage/datacache;/mnt/disk4/starrocks/storage/datacache

Enable the cache for a session: SET enable_scan_datacache = true; After enabling, the same TPC‑H query runs three times with total latencies 134.59 s, 110.20 s and 113.12 s. Cache‑hit metrics (DataCacheReadBytes ≈ 10.1 GB) indicate that most data is served from local storage, yielding ~35 % performance improvement.

Asynchronous materialized view

For high‑frequency complex queries, create an asynchronous materialized view that pre‑computes results on the Paimon table:

CREATE MATERIALIZED VIEW lineitem
DISTRIBUTED BY HASH(l_shipdate)
REFRESH IMMEDIATE MANUAL
AS
SELECT
  l_returnflag,
  l_linestatus,
  l_shipdate,
  SUM(l_quantity) AS sum_qty,
  SUM(l_extendedprice) AS sum_base_price,
  SUM(l_extendedprice * (1 - l_discount)) AS sum_disc_price,
  SUM(l_extendedprice * (1 - l_discount) * (1 + l_tax)) AS sum_charge,
  AVG(l_quantity) AS avg_qty,
  AVG(l_extendedprice) AS avg_price,
  AVG(l_discount) AS avg_disc,
  COUNT(*) AS count_order
FROM paimon_fs_catalog.paimon_tpch_flat_orc_100.lineitem
GROUP BY l_returnflag, l_linestatus, l_shipdate;

After the view is built, TPC‑H Q1 finishes in ~0.04 s (≈80× faster than the uncached direct query).

Current capabilities

Support for HDFS, OSS, S3 and OSS‑HDFS storage

Integration with Hive Metastore (HMS) and Alibaba DLF metadata services

Native handling of Paimon primary‑key and append‑only tables

Query of Paimon system tables (e.g., read‑optimized, snapshots)

Cross‑format joins between Paimon and other lake formats

Joins between Paimon tables and StarRocks internal tables

Data Cache for accelerated reads

Asynchronous materialized view with transparent query rewrite

Future roadmap

Native reader support for primary‑key tables with deletion vectors

Metadata caching to reduce I/O during analysis

Incorporate Paimon table statistics into optimizer plans

Enhance async materialized view creation and rewrite capabilities

Test environment

Benchmarks were executed on an Alibaba Cloud EMR on ECS cluster (1 master + 3 core nodes, ecs.g6.4xlarge, 16 vCPU, 64 GiB). Software versions: StarRocks 3.2.4, Trino 427, Paimon 0.7. The TPC‑H dataset consists of 100 GB of Append‑Only ORC files stored in OSS. JVM heap size for both Trino and StarRocks was set to 50 GB. Data Cache disk and memory limits were configured as shown above.

StarRocks Performance Benchmark data lake materialized view Apache Paimon Data Cache

Written by

StarRocks

StarRocks is an open‑source project under the Linux Foundation, focused on building a high‑performance, scalable analytical database that enables enterprises to create an efficient, unified lake‑house paradigm. It is widely used across many industries worldwide, helping numerous companies enhance their data analytics capabilities.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Background

Direct query of Paimon tables

Data Cache acceleration (StarRocks 2.5+)

Asynchronous materialized view

Current capabilities

Future roadmap

Test environment

StarRocks

How this landed with the community

Was this worth your time?

0 Comments

Data Cache acceleration (StarRocks 2.5+)