Databases 10 min read

How StarRocks’ Spill to Disk Boosts Query Stability and Performance

StarRocks introduces a spill-to-disk mechanism that writes intermediate results of heavy operators to disk, freeing memory and enabling stable execution of ETL and ad‑hoc queries, while combined with materialized views it dramatically improves query success rates and delivers up to 4.35× faster performance than Spark.

StarRocks
StarRocks
StarRocks
How StarRocks’ Spill to Disk Boosts Query Stability and Performance

Memory Management in StarRocks

StarRocks stores all intermediate query data in memory. When memory is insufficient, queries are aborted, which can affect online workloads. Typical scenarios that stress memory include lightweight ETL jobs (e.g., INSERT INTO … SELECT), building wide‑table materialized views, and ad‑hoc analytical queries that involve large‑table joins or high‑cardinality aggregations.

Resource Isolation (since 2.2)

StarRocks 2.2 introduced resource groups to allocate CPU and memory separately for online queries and background workloads. While this isolates online traffic, it does not guarantee that ETL or occasional large queries can finish without running out of memory.

Spill (Intermediate Result Disk‑Write) – Version 3.0+

StarRocks 3.0 adds a spill mechanism that writes the intermediate results of heavy operators to disk, freeing memory for the rest of the query. The system estimates the memory required for an operator before processing data. If the estimated reservation would exceed 80 % of the query’s memory limit , spill is triggered.

Supported operators:

Aggregation

Sort

HASH JOIN (including LEFT, RIGHT, FULL, OUTER, SEMI, and INNER joins)

Two spill modes are available:

Forced spill : users explicitly enable spill for a query.

Automatic spill : the engine decides at runtime based on current memory usage.

When an operator begins processing, StarRocks pre‑allocates the estimated memory. If the allocation stays below the 80 % threshold, execution continues in memory; otherwise the operator’s intermediate data is written to disk.

Performance Evaluation

Cluster configuration : 1 Frontend (8 CPU × 2 GB) + 3 Backends (16 CPU × 4 GB), each BE attached to two PL1‑level cloud disks.

Test set : 99 TPC‑DS queries on a 10 TB dataset.

Without spill: 15 queries failed due to memory exhaustion.

With spill (automatic): all 99 queries succeeded, demonstrating a dramatic increase in stability.

Comparison with Spark : Under the same hardware and with automatic spill enabled, StarRocks achieved 4.35× higher overall query throughput than Spark.

Materialized Views + Spill

Materialized views (MVs) accelerate common analytical patterns but can fail to build on large datasets because they require full‑table joins without predicate push‑down, exhausting memory. Enabling spill allows MVs to be constructed and refreshed successfully, and queries that are rewritten to use the MV gain significant speed‑ups.

TPC‑H 1 TB test :

Direct queries without spill often failed due to memory limits.

Enabling spill allowed all queries to finish.

Building a wide‑table MV failed without spill; with spill the MV was built and refreshed, and subsequent queries showed marked performance improvement.

Future Directions

Spill will be further integrated with resource isolation and multi‑warehouse capabilities to provide finer‑grained resource control for diverse workloads, enabling customized memory and CPU quotas per workflow.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Big DataStarRocksDatabase OptimizationQuery PerformanceMaterialized ViewsSpill
StarRocks
Written by

StarRocks

StarRocks is an open‑source project under the Linux Foundation, focused on building a high‑performance, scalable analytical database that enables enterprises to create an efficient, unified lake‑house paradigm. It is widely used across many industries worldwide, helping numerous companies enhance their data analytics capabilities.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.