How StarRocks Beats Trino: 4.3× Faster Queries on Apache Paimon Lakehouse
This article explains how to build a high‑performance data‑lake analytics stack by combining StarRocks with Apache Paimon, covering direct queries, Data Cache acceleration, and asynchronous materialized views, and presents benchmark results that show StarRocks achieving up to 4.3× faster query speeds than Trino and significant latency reductions with caching and materialized views.
Background
Apache Paimon is a lake format built on a log‑structured merge‑tree (LSM) architecture, providing real‑time updates for lakehouse workloads. StarRocks is an open‑source MPP database with a fully vectorized engine and cost‑based optimizer, capable of querying external data sources such as Paimon without data migration.
Direct query of Paimon tables
Create a filesystem‑based Paimon external catalog in StarRocks:
CREATE EXTERNAL CATALOG paimon_fs_catalog
PROPERTIES (
"type" = "paimon",
"paimon.catalog.type" = "filesystem",
"paimon.catalog.warehouse" = "oss://your-bucket/paimon/warehouse"
);Example query (TPC‑H Q1) on a 100 GB Append‑Only ORC table runs in 148.92 s on StarRocks versus 640.8 s on Trino (≈4.3× faster).
Data Cache acceleration (StarRocks 2.5+)
Configure the BE to enable the data cache and allocate disk/memory space:
# Enable data cache
datacache_enable=true
# Disk cache size (e.g., 20 GB)
datacache_disk_size=21474836480
# Memory cache size (e.g., 4 GB)
datacache_mem_size=4294967296
# Disk paths for cache storage
datacache_disk_path=/mnt/disk1/starrocks/storage/datacache;/mnt/disk2/starrocks/storage/datacache;/mnt/disk3/starrocks/storage/datacache;/mnt/disk4/starrocks/storage/datacacheEnable the cache for a session: SET enable_scan_datacache = true; After enabling, the same TPC‑H query runs three times with total latencies 134.59 s, 110.20 s and 113.12 s. Cache‑hit metrics (DataCacheReadBytes ≈ 10.1 GB) indicate that most data is served from local storage, yielding ~35 % performance improvement.
Asynchronous materialized view
For high‑frequency complex queries, create an asynchronous materialized view that pre‑computes results on the Paimon table:
CREATE MATERIALIZED VIEW lineitem
DISTRIBUTED BY HASH(l_shipdate)
REFRESH IMMEDIATE MANUAL
AS
SELECT
l_returnflag,
l_linestatus,
l_shipdate,
SUM(l_quantity) AS sum_qty,
SUM(l_extendedprice) AS sum_base_price,
SUM(l_extendedprice * (1 - l_discount)) AS sum_disc_price,
SUM(l_extendedprice * (1 - l_discount) * (1 + l_tax)) AS sum_charge,
AVG(l_quantity) AS avg_qty,
AVG(l_extendedprice) AS avg_price,
AVG(l_discount) AS avg_disc,
COUNT(*) AS count_order
FROM paimon_fs_catalog.paimon_tpch_flat_orc_100.lineitem
GROUP BY l_returnflag, l_linestatus, l_shipdate;After the view is built, TPC‑H Q1 finishes in ~0.04 s (≈80× faster than the uncached direct query).
Current capabilities
Support for HDFS, OSS, S3 and OSS‑HDFS storage
Integration with Hive Metastore (HMS) and Alibaba DLF metadata services
Native handling of Paimon primary‑key and append‑only tables
Query of Paimon system tables (e.g., read‑optimized, snapshots)
Cross‑format joins between Paimon and other lake formats
Joins between Paimon tables and StarRocks internal tables
Data Cache for accelerated reads
Asynchronous materialized view with transparent query rewrite
Future roadmap
Native reader support for primary‑key tables with deletion vectors
Metadata caching to reduce I/O during analysis
Incorporate Paimon table statistics into optimizer plans
Enhance async materialized view creation and rewrite capabilities
Test environment
Benchmarks were executed on an Alibaba Cloud EMR on ECS cluster (1 master + 3 core nodes, ecs.g6.4xlarge, 16 vCPU, 64 GiB). Software versions: StarRocks 3.2.4, Trino 427, Paimon 0.7. The TPC‑H dataset consists of 100 GB of Append‑Only ORC files stored in OSS. JVM heap size for both Trino and StarRocks was set to 50 GB. Data Cache disk and memory limits were configured as shown above.
StarRocks
StarRocks is an open‑source project under the Linux Foundation, focused on building a high‑performance, scalable analytical database that enables enterprises to create an efficient, unified lake‑house paradigm. It is widely used across many industries worldwide, helping numerous companies enhance their data analytics capabilities.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
