How Modern Data Lake Engines Accelerate Analytics: Inside StarRocks Architecture
This article explains why data lakes are essential for today’s analytics, outlines the three main user demands, defines data lakes, compares rule‑based and cost‑based optimizers, explores record‑oriented versus block‑oriented processing, and details StarRocks’ frontend‑backend architecture and benchmark results.
01 Introduction
With digital industrialization and industry digitization becoming key economic drivers, enterprises face richer data‑analysis scenarios and higher architectural requirements. Users now demand three things: lower‑cost, real‑time ingestion and storage of any amount of relational (e.g., operational databases) and non‑relational data (e.g., mobile apps, IoT, social media); strong protection of their data assets; and faster, more flexible, real‑time analytics.
Data lakes satisfy the first two demands by allowing unlimited, real‑time data ingestion, storing raw data in cheap, highly scalable object storage, and providing security mechanisms such as sensitive‑data detection, classification, privacy protection, access control, encryption, risk identification, and compliance auditing.
To meet the growing need for lake‑based analytics, a specialized analysis engine is required—one that can process more data from more sources in less time and enable collaborative data processing for better, faster decisions. This article reveals the key technologies of such an engine, using StarRocks as a concrete example.
Two follow‑up articles will dive deeper into the engine’s core and real‑world use cases: a code‑walkthrough of StarRocks’ kernel and a case‑study of large enterprises using StarRocks on data lakes.
02 What Is a Data Lake?
According to Wikipedia, a data lake is a system or repository that stores data in its natural/raw format, usually as object blobs or files. In practice, a data lake wraps cheap object storage or distributed file systems with a unified semantic layer (e.g., table‑like semantics).
Before data lakes, many organizations stored raw operational data in HDFS or S3, hoping to extract value later. However, these systems expose only file or object semantics, making it hard to understand stored data without parsing. Engineers introduced “metadata” to describe data, enabling later interpretation.
Modern data lakes have evolved to provide database‑like ACID semantics, point‑in‑time views, and high‑performance ingestion, turning them into cheap “AP databases.” Yet a full database also needs analytical capabilities, which is the focus of the next sections.
03 How to Perform Lightning‑Fast Analytics on a Data Lake?
An analytics engine for a data lake shares the same architectural blocks as a traditional database engine: Parser, Analyzer, Optimizer, and Execution Engine.
Parser: converts the user query into an abstract syntax tree (AST).
Analyzer: validates syntax and semantics.
Optimizer: generates a low‑cost physical plan.
Execution Engine: runs the physical plan and returns results.
The Optimizer and Execution Engine are the performance‑critical modules. We examine them from three angles: rule‑based vs. cost‑based optimization, record‑oriented vs. block‑oriented processing, and pull‑based vs. push‑based execution.
RBO vs CBO
Rule‑Based Optimization (RBO) applies predefined relational‑algebra rewrite rules (e.g., predicate push‑down, limit push‑down, constant folding). While deterministic, RBO cannot adapt to data size variations.
Cost‑Based Optimization (CBO) gathers statistics (row counts, column cardinalities, etc.) to estimate plan costs. For example, knowing that table C is tiny compared to A and B lets the optimizer join B with C first, reducing intermediate data size.
When the search space becomes huge, CBO uses dynamic programming, greedy, or heuristic algorithms to balance search time and plan quality.
Record‑Oriented vs. Block‑Oriented
Traditional engines are row‑oriented: operators pass one row at a time, causing many CPU cache stalls. Block‑oriented processing groups rows into batches, improving cache locality. Column‑oriented execution further enhances locality and enables SIMD optimizations.
CREATE TABLE t (n int, m int, o int, p int);</code>
<code>SELECT o FROM t WHERE m < n + 1;Row‑oriented pseudo‑code:
next:
for:
row = source.next()
if filterExpr.Eval(row):
returnedRow = []
for col in selectedCols:
returnedRow.append(row[col])
return returnedRowColumn‑oriented pseudo‑code (batch processing):
// create n+1 result column
projPlusIntIntConst.Next():
batch = source.Next()
for i < batch.n:
outCol[i] = intCol[i] + constArg
return batch
// filter using selection vector
selectLTIntInt.Next():
batch = source.Next()
for i < batch.n:
if int1Col < int2Col:
selectionVector.append(i)
return batch with selectionVector
// materialize final rows
materialize.Next():
batch = source.Next()
for s < batch.n:
i = selectionVector[s]
returnedRow = []
for col in selectedCols:
returnedRow.append(cols[col][i])
yield returnedRowPull‑Based vs. Push‑Based
Push‑Based (data‑driven) pipelines push data downstream, improving cache efficiency. Pull‑Based (demand‑driven) pipelines, exemplified by the Volcano model, pull data from upstream operators on demand.
04 Modern Data Lake Analytics Engine Architecture
Using StarRocks as an example, the engine consists of a Frontend (FE) and Backend (BE) without external dependencies.
Frontend
FE parses SQL, performs analysis, creates a logical plan, applies cost‑based optimization, generates executable fragments, and schedules them to BE nodes.
SQL Parse → AST
Analyze → syntax & semantics
Logical Plan → relational operators
Optimize → lowest‑cost physical plan
Generate Fragments → BE‑executable units
Coordinate → dispatch fragments to BE
Backend
BE nodes execute fragments: read files from the lake (e.g., Parquet, ORC), apply vectorized filtering and aggregation, and return results to FE. All BE nodes are peers; FE balances workload.
05 Summary
This article introduced the core technologies of a high‑performance data‑lake analytics engine, compared various implementation choices, and presented StarRocks’ architecture as a concrete example.
06 Appendix – Benchmark
Using the TPCH 100 GB benchmark, we compared StarRocks native tables, StarRocks on Hive, and Trino (PrestoSQL) on Hive. StarRocks native queries finished in 21 s, StarRocks on Hive in 92 s, and Trino in 307 s. The results show StarRocks on Hive outperforms Trino, though network latency and remote‑storage I/O still leave a gap to native storage, which can be narrowed with caching strategies.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
