Big Data 16 min read

Optimizing Offline Data Warehouse with StarRocks: Replacing Spark for Faster, Cost‑Effective Data Processing

By replacing part of its Spark‑based offline pipeline with StarRocks, Xiaohongshu’s data‑warehouse team cut job execution from hours to minutes, reduced resource usage over 90 %, lowered back‑fill cost by 99 %, and accelerated daily data production by 1.5 hours.

Xiaohongshu Tech REDtech

Mar 18, 2024

Optimizing Offline Data Warehouse with StarRocks: Replacing Spark for Faster, Cost‑Effective Data Processing

Data processing efficiency is a core issue in the big‑data era. Traditional offline warehouses rely on Spark, which, while robust, suffers from high resource consumption and long latency during large‑scale data back‑fill.

To overcome these limitations, the Xiaohongshu data‑warehouse team integrated StarRocks into the offline pipeline, replacing part of the Spark workload and optimizing costly Cube calculations. The new architecture reduces job execution time from hours to minutes, cuts resource usage by more than 90 %, advances daily data production by 1.5 hours, and lowers back‑fill cost by over 99 %.

The warehouse follows a layered design: ODS (raw logs), DWD (cleaned facts), DWS (aggregated data), DM (wide tables), APP (reports and services), and DIM (shared dimensions). StarRocks, an MPP‑based OLAP engine, handles the APP layer and many Cube computations, offering vectorized execution, columnar storage, and a CBO optimizer.

Key technical improvements include:

Direct import of DM, DWS, and frequently changing DIM tables into StarRocks, simplifying the data flow.

Cube modeling inside StarRocks to accelerate compute‑intensive queries.

Use of Roaring BitMap for distinct‑count operations, achieving O(1) updates and O(n) aggregation with far lower memory footprint.

Materialized views to pre‑compute and store results, automatically redirecting queries to the cached data.

Colocation Join to execute joins locally on nodes with matching shard distribution, eliminating network shuffle.

An example SQL query for UV calculation is shown below:

select
    seller_level,
    count(distinct if(buy_num>0, user_id,null)) buy_uv,
    count(distinct if(imp_num>0, user_id,null)) imp_uv,
    count(distinct if(click_num>0, user_id,null)) click_uv
from tb
group by seller_level;

Benchmarking the Spark‑based and StarRocks‑based pipelines demonstrates a 90 % reduction in back‑fill time and a 99 % drop in cost, while daily processing time shrinks to a few minutes. The team also plans to explore lake‑house integration and compute‑storage separation to further enhance flexibility and efficiency.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Optimization StarRocks Data Warehouse OLAP Spark

Written by

Xiaohongshu Tech REDtech

Official account of the Xiaohongshu tech team, sharing tech innovations and problem insights, advancing together.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.