Databases 22 min read

How StarRocks Redefines Lakehouse Architecture with Compute‑Storage Separation

StarRocks, an open‑source MPP analytical database, consolidates BI, interactive, and real‑time analytics into a single engine by evolving from version 1.0 to 3.x, introducing compute‑storage separation, unified catalog, generated columns, operator spill, and advanced materialized views, while outlining its cloud‑native lakehouse roadmap.

StarRocks
StarRocks
StarRocks
How StarRocks Redefines Lakehouse Architecture with Compute‑Storage Separation

Community Overview

StarRocks was open‑sourced in September 2021. Within two years it has attracted more than 5,700 GitHub stars and close to 300 contributors, making it one of the fastest‑growing open‑source database projects. Core code (>70%) is maintained by MirrorZhou Technology, with major contributions from Alibaba Cloud, Tencent Cloud, Volcano Engine and Didi (e.g., materialized views, CN elastic nodes, Pulsar source, Paimon catalog).

Technical Evolution

1.0 : Introduced cost‑based optimizer (CBO), vectorized execution engine and runtime filters for industry‑leading performance.

2.0 : Added pipeline engine, primary‑key model, data‑lake analysis, materialized views and resource‑group isolation to support unified workloads.

3.0 : Focuses on lake‑house integration with compute‑storage separation, enhanced materialized views and native support for external catalogs.

Compute‑Storage Separation (v3.0)

The architecture splits into a Front‑End (FE) that manages metadata and query planning, and a Back‑End (BE) that stores data on object storage (S3 or HDFS) and executes plans. StarOS abstracts distributed scheduling, storage access and cache management, allowing stateless BE nodes and elastic scaling.

Cost impact: moving from three‑replica local disks to a single‑copy S3/HDFS layout reduces storage cost by roughly 80 %.

Performance: with the data cache enabled, latency of the separated architecture matches that of the integrated design; cold reads are only 2‑3× slower, which is acceptable for most analytical workloads.

Performance Example

On a cluster of three nodes (each 16 CPU × 64 GB RAM), StarRocks completed all 99 TPC‑DS queries on a 10 TB dataset without out‑of‑memory failures. With operator spill enabled, the throughput was 4.35× that of Spark.

Lakehouse Paradigm

StarRocks provides a unified catalog that can query both internal tables and external data‑lake sources such as Hive, Iceberg, Hudi, Paimon, MySQL, PostgreSQL and Elasticsearch. This eliminates the need for separate ETL pipelines and enables federated queries across heterogeneous data sources, realizing a “One Data, All Analytics” model.

Key Feature Highlights

Generated Columns : Materialize frequently accessed fields of JSON, ARRAY, MAP or STRUCT types, dramatically accelerating semi‑structured queries.

Operator Spill : Intermediate results can be spilled to disk when memory is insufficient, allowing complex or ETL‑heavy queries to run without OOM. Demonstrated by completing TPC‑DS 10 TB on a 3 × 16c64g node with 4.35× Spark efficiency.

Advanced Materialized Views : Support any SQL, multi‑table joins, partition‑level refresh, and resource‑group isolation to protect online workloads.

Primary‑Key Model : Implements delete‑plus‑insert updates; column‑wise updates are >10× faster than row‑wise; supports both in‑memory and persisted index modes.

Partition, Bucketing & Sorting Optimizations : Automatic expression‑based partition creation, LIST partition, random bucketing, and ORDER BY‑based sorting decouple sort keys from column order. Users can run OPTIMIZE TABLE to improve layout with a single command.

Single‑Leader Replication (v2.5+) : Only one replica writes the segment file; other replicas receive a physical copy, cutting CPU, memory, I/O and network overhead roughly by half.

Data Cache : Local disk cache of remote data brings I/O latency close to that of local storage; with cache enabled, compute‑storage separation matches integrated latency.

Generated Column Example

CREATE TABLE orders (
    order_id BIGINT,
    order_info JSON,
    order_date DATE GENERATED ALWAYS AS (order_info['date']) VIRTUAL,
    order_amount DOUBLE GENERATED ALWAYS AS (order_info['amount']) VIRTUAL
) ENGINE=OLAP
PRIMARY KEY(order_id)
PARTITION BY RANGE(order_date) (
    PARTITION p2023 VALUES LESS THAN ('2024-01-01')
);

Multi‑Warehouse Elasticity

Multiple warehouses can share the same underlying data. Each warehouse can be sized independently and isolated via resource groups, enabling workloads such as BI reporting and ad‑hoc analysis to run on separate compute pools without interference.

Future Roadmap (next 12 months)

Enhance multi‑warehouse elasticity and add time‑travel capabilities.

Extend compute‑storage separation to the FE layer for greater scalability.

Introduce Pipe for continuous S3 ingestion/export; future support for Kafka and relational sources.

Improve cold‑read performance, add cache pre‑warming, and increase primary‑key throughput.

Broaden support for open table formats (e.g., Spark‑compatible) and strengthen operator spill mechanisms.

StarRocksMPP databaseMaterialized ViewslakehouseCompute-Storage Separation
StarRocks
Written by

StarRocks

StarRocks is an open‑source project under the Linux Foundation, focused on building a high‑performance, scalable analytical database that enables enterprises to create an efficient, unified lake‑house paradigm. It is widely used across many industries worldwide, helping numerous companies enhance their data analytics capabilities.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.