Big Data 19 min read

How StarRocks Achieves Lightning‑Fast Data Lake Analytics

This article explains StarRocks' streamlined architecture, cost‑based optimizer, massively parallel processing and vectorized engine, and how they enable high‑performance queries over data stored in Hive, Iceberg, Hudi and other lake formats, backed by benchmark results and future roadmap details.

StarRocks

Apr 13, 2022

Overview

StarRocks is a high‑performance analytical database that combines a minimal architecture, a vectorized execution engine, and a cost‑based optimizer (CBO) to achieve fast query processing, especially for multi‑table joins.

System Architecture

FE (Frontend) : Parses SQL, performs semantic analysis, generates logical and physical plans, and creates execution fragments.

BE (Backend) : Executes fragments in parallel, reads data from storage, applies filters, aggregates, and returns results.

Data‑lake components include Table Format (metadata), File Format (e.g., Parquet, ORC, Avro), and Storage (HDFS, OSS, S3).

Technical Details

CBO Optimizer

The optimizer is built on the Cascades and ORCA frameworks and supports TPC‑DS queries, expression and CTE reuse, join reordering, runtime filter push‑down, and low‑cardinality dictionary optimization. Accurate cost estimation depends on up‑to‑date table‑ and column‑level statistics, which can be collected automatically or manually in full or sample mode.

MPP Execution

StarRocks splits a query into multiple Query Fragments . Each fragment contains operators such as Scan, Filter, and Aggregate and is scheduled to BE nodes. Fragments run in parallel pipelines, avoiding stage‑by‑stage execution. Shuffle operations enable horizontal scaling for large joins and high‑cardinality aggregations.

Vectorized Execution Engine

All operators are reimplemented to process column‑wise batches. Vectorized execution reduces virtual‑function calls and branch mispredictions, improves CPU cache utilization, and enables SIMD optimizations, delivering a 5‑10× performance boost over row‑wise execution.

Data Lake Query Support

Iceberg v1 table queries – https://github.com/StarRocks/starrocks/issues/1030

Hive external table queries – https://docs.starrocks.com/zh-cn/main/using_starrocks/External_table

Hudi COW table queries – https://github.com/StarRocks/starrocks/issues/2772

Optimization techniques applied to lake queries include:

Statistics collection (full or sample) for accurate cost estimates.

Partition pruning to scan only relevant partitions.

Join reorder based on data size and distribution.

Predicate push‑down (e.g., Min/Max filters on Parquet) to filter data at the storage layer.

MySQL [hive_test]> explain select l_quantity from lineitem;

create external table store_sales(
  ss_sold_time_sk bigint,
  ss_item_sk bigint,
  ...
) ENGINE=HIVE PROPERTIES (
  "resource" = "hive_tpcds",
  "database" = "tpcds",
  "table" = "store_sales"
);

select ss_sold_time_sk from store_sales
where ss_sold_date_sk between 2451911 and 2451941
order ss_sold_time_sk;

Benchmark Results

Using the TPC‑H 100 GB dataset (22 queries), StarRocks was compared with Trino. Three execution modes were measured:

StarRocks local storage: total time 21 s

StarRocks on Hive (external table): total time 92 s

Trino on Hive: total time 307 s

The additional latency for Hive queries is mainly due to network I/O and remote storage access. Future work includes caching mechanisms to narrow the gap.

Future Roadmap

Integrate a push‑based pipeline execution engine to further reduce query latency.

Automate hot‑cold data tiering, moving infrequently accessed data from local tables to the lake.

Eliminate explicit external‑table creation by syncing lake resources automatically.

Support advanced lake features: Hudi MOR tables, Iceberg v2 tables, direct writes to the lake, time‑travel queries, and richer catalog integration.

Introduce hierarchical caching to boost lake‑query performance.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data StarRocks Data Lake MPP Vectorized Engine CBO

Written by

StarRocks

StarRocks is an open‑source project under the Linux Foundation, focused on building a high‑performance, scalable analytical database that enables enterprises to create an efficient, unified lake‑house paradigm. It is widely used across many industries worldwide, helping numerous companies enhance their data analytics capabilities.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.