Big Data 17 min read

How StarRocks Powers Ultra‑Fast Data Lake Analytics: Architecture and Core Techniques

This article explains the fundamentals of data lake analytics, compares optimization strategies such as rule‑based vs cost‑based and record‑oriented vs block‑oriented processing, describes StarRocks' lightweight frontend/backend architecture, and presents benchmark results that demonstrate its performance advantages over competing engines.

StarRocks

Mar 4, 2022

What Is a Data Lake

A data lake is a repository that stores raw data in its natural format—typically object blobs or files—on cheap object storage or distributed file systems, presenting a unified semantic view such as tables to downstream applications.

Why Use a Data Lake for Analytics

Data lakes enable low‑cost, real‑time ingestion of both relational and non‑relational data from diverse sources (e.g., operational databases, IoT devices, social media). They also provide security mechanisms like metadata tagging, classification, encryption, and access control, ensuring data protection and compliance.

Fast Analysis on Data Lakes

To meet the demand for rapid, flexible analytics, a data‑lake‑specific analysis engine must combine high‑throughput data ingestion with powerful query processing. The engine typically consists of four modules:

Parser – converts SQL text into an abstract syntax tree (AST).

Analyzer – validates syntax and semantics.

Optimizer – generates a low‑cost physical plan.

Execution Engine – executes the plan and returns results.

The optimizer and execution engine are the performance‑critical components. Three key technical dimensions are examined:

Rule‑Based Optimization (RBO) vs Cost‑Based Optimization (CBO)

RBO applies predefined algebraic rewrite rules (e.g., predicate push‑down, limit push‑down, constant folding) to produce a deterministic plan, but it cannot adapt to data size or distribution. CBO collects statistics (row counts, column cardinalities, etc.) to estimate plan costs and choose the cheapest execution order, often using dynamic programming or heuristic search.

Record‑Oriented vs Block‑Oriented Processing

Traditional row‑oriented engines process one tuple at a time, leading to poor CPU cache utilization and many branch mispredictions. Block‑oriented (or vectorized) processing groups rows into batches, reducing per‑tuple overhead. Column‑oriented storage further improves locality and enables SIMD optimizations.

CREATE TABLE t (n int, m int, o int, p int);
SELECT o FROM t WHERE m < n + 1;

Row‑oriented pseudo‑code:

next:
  for each row in source:
    if filterExpr.Eval(row):
      returnedRow = []
      for col in selectedCols:
        returnedRow.append(row[col])
      return returnedRow

Column‑oriented pseudo‑code (batch processing):

// Create result column
for i < batch.n:
  outCol[i] = intCol[i] + constArg
// Selection vector
for i < batch.n:
  if int1Col[i] < int2Col[i]:
    selectionVector.append(i)
// Materialize
for i in selectionVector:
  returnedRow = []
  for col in selectedCols:
    returnedRow.append(cols[col][i])
  yield returnedRow

Pull‑Based vs Push‑Based Execution

Pull‑based (volcano model) lets downstream operators request data from upstream operators, while push‑based streams data downstream as soon as it is produced. Push‑based pipelines improve cache efficiency and can yield higher throughput.

Modern Data Lake Analytics Engine Architecture (StarRocks)

StarRocks adopts a minimalist architecture with only two process types: Frontend (FE) and Backend (BE). No external components are required, simplifying deployment.

Frontend (FE)

FE parses SQL, performs analysis, generates logical plans, applies cost‑based optimization, and produces executable fragments that are dispatched to BE nodes.

SQL Parse – AST generation.

Analyze – Syntax & semantic checks.

Logical Plan – Convert AST to relational operators.

Optimize – Apply statistics‑driven cost model.

Fragment Generation – Translate physical plan to BE‑runnable fragments.

Coordinate – Schedule fragments across BE nodes.

Backend (BE)

BE nodes execute fragments by reading data from the lake (e.g., Parquet or ORC readers), applying vectorized filters and aggregations, and returning results to FE. All BE nodes are peers; FE distributes work based on data locality.

Appendix: Benchmark

The benchmark uses the TPCH 100 GB dataset (22 queries) to compare three configurations: StarRocks local tables, StarRocks on Hive, and Trino (PrestoSQL) on Hive. All Hive tables are stored as ORC with zlib compression, and the tests run on Alibaba Cloud EMR.

Results: StarRocks local storage completes all queries in 21 s, StarRocks on Hive in 92 s, while Trino on Hive takes 307 s. StarRocks on Hive outperforms Trino significantly, though it is still slower than native storage due to network and I/O overhead. Future work includes caching and other optimizations to narrow this gap.

Conclusion

The article details the core technical principles of a high‑performance data lake analytics engine, contrasts alternative implementation strategies, and demonstrates how StarRocks integrates these ideas into a compact, efficient architecture.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

StarRocks Data Lake Analytics Engine

Written by

StarRocks

StarRocks is an open‑source project under the Linux Foundation, focused on building a high‑performance, scalable analytical database that enables enterprises to create an efficient, unified lake‑house paradigm. It is widely used across many industries worldwide, helping numerous companies enhance their data analytics capabilities.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.