Databases 26 min read

Inside StarRocks Optimizer: Architecture, Multi‑Stage Optimization, and Advanced Features

This article provides a comprehensive technical overview of StarRocks' query optimizer, covering its evolution, core architecture, multi‑stage optimization pipeline, key optimizations such as multi‑join colocate, low‑cardinality global dictionary, MV union rewrite, and advanced mechanisms like cost‑estimation fixes, query feedback, adaptive execution, runtime filters, join‑reorder strategies, and SQL plan management.

StarRocks

Apr 24, 2025

Development History and Architecture

StarRocks has released three major versions in the past five years: 1.0 introduced a vectorized engine and CBO for a fast OLAP database; 2.0 added primary‑key models, data‑lake analysis, and further query engine optimizations; 3.0 brought storage‑compute separation, materialized views, and an Open LakeHouse architecture.

The system consists of two core components: Frontend (FE) – Java‑based metadata management, query optimization, and scheduling; and Compute Node (CN) – C++‑based execution engine and data cache for lake storage.

Optimizer Overview

The optimizer follows the Cascades framework with a rule‑based layer and a cost‑based layer (CBO). Its multi‑stage pipeline includes:

Logical Rewrite : tree‑to‑tree transformations such as sub‑query rewrite, CTE inline, predicate push‑down, and column pruning.

Cost‑Based Optimization : uses a memo structure to explore join orders, join distribution strategies, aggregation distribution, and materialized‑view rewrites.

Physical Rewrite : applies low‑cost physical rewrites (e.g., global dictionary rewrite, expression reuse) and inserts decode nodes when necessary.

Feedback Tuning (2024): dynamically adjusts plans based on runtime feedback.

Key Optimizations

1. Multi‑Left Join Colocate

StarRocks distinguishes four join strategies—Shuffle, Broadcast, Replication, and Colocate. Colocate joins execute locally when tables are already distributed by the join key, avoiding network shuffles. The optimizer uses a Distribution Property Enforcer to enforce colocate properties, handling NULL values with NULL‑relax and NULL‑strict modes.

2. Low‑Cardinality Global Dictionary Optimization

Starting from version 2.0, StarRocks builds global dictionaries for low‑cardinality string columns (default threshold 256). During the Physical Rewrite stage, eligible columns are replaced with their integer dictionary IDs, enabling faster scans, joins, aggregations, and string functions. This can improve aggregation query performance by up to three times.

3. Partitioned MV Union Rewrite

Materialized views (MVs) have been a core feature since 1.0 and were enhanced in 2.0. The optimizer can automatically rewrite queries to union MV results with base‑table data, handling scenarios such as recent data queries on a lake, or partially refreshed MV partitions. The rewrite adds compensation predicates to the MV scan and unions it with the base‑table scan.

Cost Estimation Challenges and Solutions

Common issues include missing or inaccurate statistics, sampling errors, independence assumptions, data skew, and plan volatility. StarRocks addresses these with three advanced mechanisms:

Query Feedback : records bad plans and tuning guides from runtime information, then applies improved plans for similar queries.

Adaptive Execution : dynamically selects aggregation strategies (one‑stage vs. two‑stage) and adjusts runtime filter usage based on observed selectivity.

SQL Plan Management (SPM) : generates baseline plans with parameter placeholders, stores them, and forces their reuse to avoid performance regressions after upgrades.

Query Feedback Details

Two modules—Plan Tuning Analyzer and SQL Tuning Advisor—detect sub‑optimal join orders, join distribution strategies, or streaming aggregation choices, and store corrective hints. Subsequent queries retrieve and apply these hints automatically.

Adaptive Execution Details

For group‑by queries, StarRocks initially generates a two‑stage distributed plan with local aggregation. If runtime statistics show low aggregation benefit, it switches to a one‑stage plan, avoiding unnecessary hash tables. Adaptive runtime filters also limit the number of active Bloom filters based on observed selectivity.

SQL Plan Management (SPM)

SPM creates a baseline plan by constant‑parameterizing the SQL, generating a physical plan, and storing a hint‑rich SQL version. When the same query is issued, the optimizer substitutes actual constants back into the baseline SQL and reuses the stored plan, ensuring stable performance across system upgrades.

Join Reorder Strategies

StarRocks employs different algorithms based on join count: up to 4 joins – associative and commutative rules; up to 10 joins – left‑deep tree + dynamic programming + greedy; up to 16 joins – left‑deep tree + greedy; up to 50 joins – left‑deep tree only; more than 50 joins – no reorder. This limits the explosion of join‑order and distribution‑strategy combinations.

FAQ and Summary

1. Cost estimation errors are inevitable; the executor must adapt using feedback and adaptive execution.

2. Robust testing (correctness, performance, plan quality) is essential for optimizer reliability.

3. Null handling remains a complex challenge for both optimizer and vectorized execution.

4. Integrated optimizers that collaborate with storage, statistics, and execution layers achieve superior performance.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

StarRocks OLAP Materialized Views Adaptive Execution Query Optimizer Cost-Based Optimization SQL Plan Management

Written by

StarRocks

StarRocks is an open‑source project under the Linux Foundation, focused on building a high‑performance, scalable analytical database that enables enterprises to create an efficient, unified lake‑house paradigm. It is widely used across many industries worldwide, helping numerous companies enhance their data analytics capabilities.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.