Databases 14 min read

In-Depth Analysis of StarRocks Optimizer Architecture and Techniques

This article provides a comprehensive technical overview of StarRocks' query optimizer, covering its cascades/ORCA-inspired architecture, logical and physical plan transformations, cost modeling, statistics derivation, memo structure, task scheduling, and practical examples of join optimization in a distributed OLAP engine.

Big Data Technology Architecture
Big Data Technology Architecture
Big Data Technology Architecture
In-Depth Analysis of StarRocks Optimizer Architecture and Techniques

StarRocks, a modern OLAP engine, implements an efficient and stable planner/optimizer inspired by the Cascades and ORCA frameworks. The optimizer consists of several stages: Analyzer validates catalog information, Rewriter applies logical‑to‑logical transformations using rule‑based tree rewrites, and the Cost‑Based Optimizer (CBO) generates physical plans based on cost estimates.

The overall optimization flow includes Exploration (generating equivalent logical plans via algebraic rules such as join commutativity and associativity), Statistics Derivation (collecting row counts, column cardinalities, histograms, etc., via a bottom‑up traversal), Implementation (converting logical operators to physical operators like HashJoin, SortMergeJoin, or IndexScan), and Optimization (evaluating costs and pruning sub‑trees).

Rewriter performs top‑down pattern matching on binary trees, splitting predicates and pushing them down to improve join ordering.

CBO maintains two key data structures: LowestCostExpressions (mapping required physical properties to the best expression) and LowestCostTable (tracking required properties for child groups).

Property enforcement handles required sort and distribution properties by inserting Enforcer operators (e.g., Shuffle, Broadcast) and accounting for their costs.

StarRocks adopts a stratified search strategy where logical‑to‑logical and logical‑to‑physical transformations are interleaved, allowing early cost‑based pruning and reducing the search space.

The optimizer is driven by a task scheduler with distinct task types: DeriveStatsTask (statistics derivation), OptimizeExpressionTask (rule application), ApplyRuleTask (transformations), and EnforceAndCostTask (cost calculation and property enforcement). These tasks operate on GroupExpression and OptExpression structures stored in a Memo.

Cost modeling for operators such as HashJoin considers CPU cost (output row size), memory cost (size of the build side), and network cost (typically zero for hash joins). Statistics objects contain fields like outputRowCount , columnStatistics (min, max, average row size, distinct values, etc.), which feed into the cost formulas.

Implementation details include classes for statistics collection (CreateAnalyzeJobStmt, AnalyzeStmt, StatisticAutoCollector, StatisticsCalculator) and the overall optimizer pipeline illustrated with diagrams of memo evolution, join optimization examples, and task flow.

Finally, the article references foundational papers (Cascades, ORCA, CMU 15‑721) and provides links to StarRocks source code for deeper exploration.

DatabaseStarRocksOLAPcost modelQuery OptimizerCascades
Big Data Technology Architecture
Written by

Big Data Technology Architecture

Exploring Open Source Big Data and AI Technologies

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.