How StarRocks 2.5 Improves Materialized Views for Real‑Time and Offline Queries
This article analyzes the requirements, design choices, and implementation details of materialized views in StarRocks, covering demand analysis, synchronous and asynchronous refresh solutions, partition binding, task scheduling, partition‑refresh maintenance, Insert‑Overwrite mechanics, view invalidation handling, and the upcoming features planned for version 2.5.
Demand Analysis of Materialized Views
To understand materialized views, first recall that a view is a virtual table defined by a query. A materialized view (MV) materializes this virtual table as a physical table, enabling query acceleration through automatic QueryRewrite when similar SQL statements are issued.
Typical scenarios include:
Real‑time joins of two streaming tables – single‑table sync MVs cannot handle multi‑table joins.
Offline multi‑table modeling where fact and dimension tables are joined – users expect both query acceleration and modeling simplification.
Cases where the MV result set is small but the default partition/bucket layout would be large, causing performance degradation.
These gaps motivate the design of more flexible MV capabilities.
Design of Materialized Views
Two complementary solutions are introduced:
Synchronous solution – optimal for low‑complexity SQL with strong consistency, but struggles with complex queries.
Asynchronous refresh solution – accepts limited refresh latency to support complex SQL. Users can configure periodic refresh (e.g., every minute) for high‑frequency streaming data, or manual refresh for fine‑grained control.
When creating an MV, users can bind it to a specific partition of a base table using PARTITION BY. Buckets are mandatory but can be set independently of the base table, allowing reduction of partition/bucket count when the MV result is small.
Refresh options include: REFRESH ASYNC – automatically determines which partitions need refresh when base tables change. REFRESH MANUAL – lets users trigger refresh explicitly.
These features are already available in StarRocks 2.4.
Implementation Principles
StarRocks 2.4 implements a multi‑table asynchronous MV framework that relies on three core technologies:
Task scheduling framework
Partition‑refresh maintenance
Insert‑Overwrite mechanism
Task Scheduling Framework
A Task acts as a reusable template; a TaskRun is the actual execution unit derived from the template, analogous to a Java class and its object. The framework supports three refresh modes: manual, trigger‑based, and periodic.
Execution flow:
TaskRun objects are created from a Task and placed into a Pending queue.
The scheduler pulls TaskRuns from Pending, runs them, and moves them to a Running queue.
After completion, TaskRuns are recorded in a TaskRun history collection.
Default parameters: queue length 500, concurrency 20, history retention 3 days.
Partition Refresh Maintenance
When an MV is created, it can be bound to a specific partition of a base table. The refresh framework adds or removes partitions before each refresh to ensure the MV’s partition set covers the bound base partitions. Partition mapping can be 1:1, 1:n, n:1, or n:n.
Refresh decisions are made by comparing version numbers of base‑table partitions and MV partitions. If a base partition’s version is newer than the corresponding MV partition, the MV partition is refreshed using the Insert‑Overwrite technique.
Insert Overwrite
The Insert‑Overwrite process consists of three steps:
Create a temporary partition.
Write the refreshed data into the temporary partition.
Atomically replace the target partition with the temporary one.
This mechanism guarantees that readers either see the old data or the fully refreshed new data, never a partially updated state.
View Invalidation
If the schema of a base table changes (e.g., a column is dropped), the associated MV becomes invalid for refresh. The MV can still be queried, but it must be rebuilt to resume refresh capability.
Roadmap for StarRocks 2.5
Version 2.5 will add several critical features that were missing in 2.4:
QueryRewrite support for automatic MV substitution.
Ability to create MVs from external tables and to create MVs on top of existing MVs.
MV TTL (time‑to‑live) management.
Optimized refresh efficiency, including partial‑refresh syntax and configuration to handle complex refresh scenarios.
These enhancements aim to broaden MV applicability and improve performance in both real‑time and offline workloads.
StarRocks
StarRocks is an open‑source project under the Linux Foundation, focused on building a high‑performance, scalable analytical database that enables enterprises to create an efficient, unified lake‑house paradigm. It is widely used across many industries worldwide, helping numerous companies enhance their data analytics capabilities.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
