Databases 26 min read

Inside StarRocks Optimizer: Architecture, Multi‑Stage Optimization, and Advanced Features

This article provides a comprehensive technical overview of StarRocks' query optimizer, covering its evolution, core architecture, multi‑stage optimization pipeline, key optimizations such as multi‑join colocate, low‑cardinality global dictionary, MV union rewrite, and advanced mechanisms like cost‑estimation fixes, query feedback, adaptive execution, runtime filters, join‑reorder strategies, and SQL plan management.

StarRocks
StarRocks
StarRocks
Inside StarRocks Optimizer: Architecture, Multi‑Stage Optimization, and Advanced Features

Development History and Architecture

StarRocks has released three major versions in the past five years: 1.0 introduced a vectorized engine and CBO for a fast OLAP database; 2.0 added primary‑key models, data‑lake analysis, and further query engine optimizations; 3.0 brought storage‑compute separation, materialized views, and an Open LakeHouse architecture.

The system consists of two core components: Frontend (FE) – Java‑based metadata management, query optimization, and scheduling; and Compute Node (CN) – C++‑based execution engine and data cache for lake storage.

StarRocks architecture diagram
StarRocks architecture diagram

Optimizer Overview

The optimizer follows the Cascades framework with a rule‑based layer and a cost‑based layer (CBO). Its multi‑stage pipeline includes:

Logical Rewrite : tree‑to‑tree transformations such as sub‑query rewrite, CTE inline, predicate push‑down, and column pruning.

Cost‑Based Optimization : uses a memo structure to explore join orders, join distribution strategies, aggregation distribution, and materialized‑view rewrites.

Physical Rewrite : applies low‑cost physical rewrites (e.g., global dictionary rewrite, expression reuse) and inserts decode nodes when necessary.

Feedback Tuning (2024): dynamically adjusts plans based on runtime feedback.

Optimizer pipeline diagram
Optimizer pipeline diagram

Key Optimizations

1. Multi‑Left Join Colocate

StarRocks distinguishes four join strategies—Shuffle, Broadcast, Replication, and Colocate. Colocate joins execute locally when tables are already distributed by the join key, avoiding network shuffles. The optimizer uses a Distribution Property Enforcer to enforce colocate properties, handling NULL values with NULL‑relax and NULL‑strict modes.

Colocate join illustration
Colocate join illustration

2. Low‑Cardinality Global Dictionary Optimization

Starting from version 2.0, StarRocks builds global dictionaries for low‑cardinality string columns (default threshold 256). During the Physical Rewrite stage, eligible columns are replaced with their integer dictionary IDs, enabling faster scans, joins, aggregations, and string functions. This can improve aggregation query performance by up to three times.

Global dictionary optimization example
Global dictionary optimization example

3. Partitioned MV Union Rewrite

Materialized views (MVs) have been a core feature since 1.0 and were enhanced in 2.0. The optimizer can automatically rewrite queries to union MV results with base‑table data, handling scenarios such as recent data queries on a lake, or partially refreshed MV partitions. The rewrite adds compensation predicates to the MV scan and unions it with the base‑table scan.

MV union rewrite diagram
MV union rewrite diagram

Cost Estimation Challenges and Solutions

Common issues include missing or inaccurate statistics, sampling errors, independence assumptions, data skew, and plan volatility. StarRocks addresses these with three advanced mechanisms:

Query Feedback : records bad plans and tuning guides from runtime information, then applies improved plans for similar queries.

Adaptive Execution : dynamically selects aggregation strategies (one‑stage vs. two‑stage) and adjusts runtime filter usage based on observed selectivity.

SQL Plan Management (SPM) : generates baseline plans with parameter placeholders, stores them, and forces their reuse to avoid performance regressions after upgrades.

Query Feedback Details

Two modules—Plan Tuning Analyzer and SQL Tuning Advisor—detect sub‑optimal join orders, join distribution strategies, or streaming aggregation choices, and store corrective hints. Subsequent queries retrieve and apply these hints automatically.

Query feedback workflow
Query feedback workflow

Adaptive Execution Details

For group‑by queries, StarRocks initially generates a two‑stage distributed plan with local aggregation. If runtime statistics show low aggregation benefit, it switches to a one‑stage plan, avoiding unnecessary hash tables. Adaptive runtime filters also limit the number of active Bloom filters based on observed selectivity.

Adaptive execution illustration
Adaptive execution illustration

SQL Plan Management (SPM)

SPM creates a baseline plan by constant‑parameterizing the SQL, generating a physical plan, and storing a hint‑rich SQL version. When the same query is issued, the optimizer substitutes actual constants back into the baseline SQL and reuses the stored plan, ensuring stable performance across system upgrades.

SQL plan management flow
SQL plan management flow

Join Reorder Strategies

StarRocks employs different algorithms based on join count: up to 4 joins – associative and commutative rules; up to 10 joins – left‑deep tree + dynamic programming + greedy; up to 16 joins – left‑deep tree + greedy; up to 50 joins – left‑deep tree only; more than 50 joins – no reorder. This limits the explosion of join‑order and distribution‑strategy combinations.

FAQ and Summary

1. Cost estimation errors are inevitable; the executor must adapt using feedback and adaptive execution.

2. Robust testing (correctness, performance, plan quality) is essential for optimizer reliability.

3. Null handling remains a complex challenge for both optimizer and vectorized execution.

4. Integrated optimizers that collaborate with storage, statistics, and execution layers achieve superior performance.

StarRocksOLAPMaterialized ViewsAdaptive ExecutionQuery Optimizercost‑based optimizationSQL Plan Management
StarRocks
Written by

StarRocks

StarRocks is an open‑source project under the Linux Foundation, focused on building a high‑performance, scalable analytical database that enables enterprises to create an efficient, unified lake‑house paradigm. It is widely used across many industries worldwide, helping numerous companies enhance their data analytics capabilities.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.