Optimizing Offline Data Warehouse with StarRocks: Replacing Spark for Faster, Cost‑Effective Data Processing
By replacing part of its Spark‑based offline pipeline with StarRocks, Xiaohongshu’s data‑warehouse team cut job execution from hours to minutes, reduced resource usage over 90 %, lowered back‑fill cost by 99 %, and accelerated daily data production by 1.5 hours.
Data processing efficiency is a core issue in the big‑data era. Traditional offline warehouses rely on Spark, which, while robust, suffers from high resource consumption and long latency during large‑scale data back‑fill.
To overcome these limitations, the Xiaohongshu data‑warehouse team integrated StarRocks into the offline pipeline, replacing part of the Spark workload and optimizing costly Cube calculations. The new architecture reduces job execution time from hours to minutes, cuts resource usage by more than 90 %, advances daily data production by 1.5 hours, and lowers back‑fill cost by over 99 %.
The warehouse follows a layered design: ODS (raw logs), DWD (cleaned facts), DWS (aggregated data), DM (wide tables), APP (reports and services), and DIM (shared dimensions). StarRocks, an MPP‑based OLAP engine, handles the APP layer and many Cube computations, offering vectorized execution, columnar storage, and a CBO optimizer.
Key technical improvements include:
Direct import of DM, DWS, and frequently changing DIM tables into StarRocks, simplifying the data flow.
Cube modeling inside StarRocks to accelerate compute‑intensive queries.
Use of Roaring BitMap for distinct‑count operations, achieving O(1) updates and O(n) aggregation with far lower memory footprint.
Materialized views to pre‑compute and store results, automatically redirecting queries to the cached data.
Colocation Join to execute joins locally on nodes with matching shard distribution, eliminating network shuffle.
An example SQL query for UV calculation is shown below:
select
seller_level,
count(distinct if(buy_num>0, user_id,null)) buy_uv,
count(distinct if(imp_num>0, user_id,null)) imp_uv,
count(distinct if(click_num>0, user_id,null)) click_uv
from tb
group by seller_level;Benchmarking the Spark‑based and StarRocks‑based pipelines demonstrates a 90 % reduction in back‑fill time and a 99 % drop in cost, while daily processing time shrinks to a few minutes. The team also plans to explore lake‑house integration and compute‑storage separation to further enhance flexibility and efficiency.
Xiaohongshu Tech REDtech
Official account of the Xiaohongshu tech team, sharing tech innovations and problem insights, advancing together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.