Big Data 18 min read

How Replacing Spark with StarRocks Cut Data Refresh Time by 90% and Saved 99% Cost

The article details how the Xiaohongshu data warehouse team integrated StarRocks into their offline processing pipeline, replacing Spark for heavy Cube calculations, which reduced job execution from hours to minutes, cut resource consumption by over 90%, advanced daily data output by 1.5 hours, and lowered refresh cost by more than 99%.

StarRocks

Mar 26, 2024

Data processing efficiency is a core concern in the era of big data, driving continuous evolution of offline data warehouse engines from early MapReduce to modern Spark. While Spark‑based warehouses have improved large‑scale data back‑fill, they still suffer from high resource and time costs.

Problem with Spark

Spark does not manage data distribution, storage format, or metadata for query optimization, and it must write intermediate data to disk during cross‑node transfers. For large back‑fill jobs (e.g., two years of transaction data), Spark can require the equivalent of 70 000 machines for 30 days, costing over a million yuan.

Technical Choice: StarRocks vs. ClickHouse

Both ClickHouse and StarRocks are columnar MPP OLAP engines. ClickHouse offers fast columnar scans but lacks distributed join support and has higher concurrency limits. StarRocks provides vectorized execution, a cost‑based optimizer, smart materialized views, and a compute‑storage‑integrated architecture (FE for metadata, BE for storage and computation), allowing queries to run directly on BE nodes without data movement.

Architecture Redesign

The team introduced a layered offline warehouse design (ODS → DWD → DWS → DM → APP → DIM) and replaced the most resource‑intensive Cube calculations with StarRocks. Key changes include:

Direct Import: DM, DWS, and frequently changing DIM tables are loaded straight into StarRocks, simplifying the pipeline.

Cube Modeling in StarRocks: Compute‑intensive Cube tables are built inside StarRocks, leveraging its vectorized engine.

To further reduce data volume, the traditional COUNT(DISTINCT …) approach was replaced with BitMap deduplication. BitMap stores a single bit per ID, achieving up to 32× storage reduction for int32 values and O(1) update time. The team also introduced an encoding table to map string IDs to compact numeric IDs before BitMap conversion.

SQL Example

select seller_level,
       count(distinct if(buy_num>0, user_id,null)) buy_uv,
       count(distinct if(imp_num>0, user_id,null)) imp_uv,
       count(distinct if(click_num>0, user_id,null)) click_uv
from tb
group by seller_level;

During execution, Spark would create an intermediate table, expand three virtual dimensions (c1, c2, c3), and perform three shuffle rounds, inflating data size. In StarRocks, the same logic runs with a single shuffle and no intermediate materialization.

Materialized Views and Colocate Join

StarRocks materialized views pre‑compute and store aggregation results, automatically redirecting queries to the materialized data and eliminating repeated scans of base tables. Colocate Join groups tables with identical distribution on the same nodes, enabling local joins without network transfer.

Performance Results

Job execution time compressed from hour‑level to minute‑level.

Resource usage reduced by >90%.

Daily data production advanced by 1.5 hours.

Back‑fill time cut by 90% and cost lowered by >99% (from millions of yuan to a few thousand).

These gains were achieved without provisioning additional resources; the StarRocks cluster handled both back‑fill and regular daily workloads.

Implementation Steps

Control the size of DM and DWS tables (rows, columns, field width) to limit resource consumption.

Rewrite SQL to batch large Cube calculations, reducing intermediate data expansion.

Deploy BitMap deduplication and encoding tables for distinct‑count metrics.

Enable materialized views for frequently queried aggregates.

Use Colocate Join for high‑concurrency join operations.

Future Outlook

The team plans to explore StarRocks in lake‑warehouse integration and compute‑storage separation scenarios to build more flexible data production pipelines and self‑service analytics.

performance optimization Big Data StarRocks OLAP Spark

Written by

StarRocks

StarRocks is an open‑source project under the Linux Foundation, focused on building a high‑performance, scalable analytical database that enables enterprises to create an efficient, unified lake‑house paradigm. It is widely used across many industries worldwide, helping numerous companies enhance their data analytics capabilities.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.