How Xiaohongshu Accelerated Data Warehouse Queries with Logical Datasets & Materialized Views
Xiaohongshu tackled low reuse of APP tables, limited scalability of single-table BI datasets, and poor dashboard query performance by introducing logical datasets and materialized views, which enable query pruning, reduce data redundancy, and accelerate BI queries, achieving up to 80% latency reduction and higher hit rates.
Background and Motivation
Xiaohongshu faced three major data‑warehouse challenges: low reuse of APP tables, insufficient scalability of single‑table BI datasets, and poor performance of dashboard queries built on wide tables. As the business grew, the number of exported datasets increased, putting pressure on data‑engineers and degrading query response times.
Problem Analysis
The existing RedBI platform offered two dataset types: single‑table datasets and SQL datasets (logical views). SQL datasets often involve a fact table joined with many dimension tables, leading to heavy join operations and inability to prune unused columns. This caused slower queries compared to direct APP table queries, especially when only a subset of fields was needed.
Additionally, wide‑table datasets suffered from redundant fields, making maintenance difficult and hindering the addition of new analysis scenarios.
Overall Solution
The team introduced a logical dataset model (wide‑table + dimension‑table) combined with materialized views to achieve both data‑generalization and high query performance.
Logical datasets enable query pruning: only the tables and columns required for a specific analysis are joined, dramatically reducing unnecessary computation.
Materialized views pre‑aggregate data into smaller tables, providing fast query paths for dashboard workloads that typically scan large time ranges.
Logical Dataset Design
A logical dataset consists of two parts: dataset configuration and query pruning. Users configure datasets via a graphical UI where nodes represent tables or SQL statements, and edges represent join keys. The system enforces two conditions for pruning:
All joins must be LEFT‑JOIN.
The join key on the right‑hand table must be unique (no duplicate values).
During query execution, the platform extracts the list of nodes that can be pruned based on the query’s dimensions, measures, and filters, removes them, and then joins the remaining tables to form the final BI query.
Query Pruning Steps
Obtain the list of prune‑eligible nodes from dataset metadata.
Flatten all fields used in the current query (dimensions, measures, filters) and exclude nodes that are not needed.
Derive the optimized node list after removal.
Join the remaining tables according to the defined join conditions and execute the BI query.
Materialized View Strategy
Materialized views are built on top of logical datasets to accelerate dashboard queries. Compared with traditional APP tables (CUBE), materialized views offer two key advantages:
Lower computational complexity: They aggregate wide tables directly without generating multiple intermediate copies, avoiding data explosion.
Higher reusability: Because a materialized view is bound to a logical dataset, any query that matches its pattern can reuse it, regardless of the originating dashboard.
The implementation uses StarRocks’ materialized view feature, but the native engine’s synchronous and asynchronous modes were found unsuitable for production due to high refresh overhead and scheduling constraints. Therefore, RedBI implements its own materialized‑view layer that leverages the platform’s existing job scheduler and can handle bitmap‑based UV metrics.
Technical Implementation
Configure materialized views via RedBI’s UI.
Monitor view freshness and trigger refresh jobs as needed.
Rewrite incoming queries: match query signatures against configured views, replace the original query with a materialized‑view query, and apply further optimizations (predicate push‑down, window‑function handling, engine‑specific functions).
Generate and schedule ETL tasks that pre‑compute aggregates and store them as compact materialized tables.
For UV metrics, the system encodes distinct identifiers into BITMAP structures (using Roaring Bitmap) and applies bucketed ID ranges to reduce skew and improve compression. After bucketing, query latency improved by up to five times.
Operational Considerations
Materialized view maintenance addresses three change scenarios:
Upstream dataset data refreshes trigger downstream view refreshes automatically.
Dashboard or template modifications may invalidate existing views; the platform notifies owners to retest and adjust view definitions.
Dataset schema changes (new or altered fields) cause view invalidation; alerts are sent to data owners for remediation.
Query rewriting relies on an internal abstract representation called LodQuery, which separates dimensions, measures, and join information, enabling flexible matching against materialized‑view definitions before converting to the engine‑specific TableQuery and finally to executable SQL.
Results and Impact
Since deployment, logical datasets have replaced most SQL datasets, covering over 100 datasets and accounting for 30% of BI queries. Materialized views have been applied to more than 40 datasets across transaction, live‑stream, and advertising domains, reducing average query time by 80% and achieving a 30% hit rate.
Future Directions
Planned improvements include:
Supporting high‑cardinality deduplication metrics with more efficient algorithms.
Developing smarter, resource‑aware refresh scheduling to minimize cluster impact.
Leveraging AI for automated lifecycle management of materialized views (creation, consolidation, retirement).
Extending logical datasets to handle multi‑granularity tables (e.g., combining daily and historical data) to further reduce the number of core datasets.
Overall, the logical‑dataset + materialized‑view architecture has paved the way for a more intelligent, scalable data‑warehouse ecosystem at Xiaohongshu.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Xiaohongshu Tech REDtech
Official account of the Xiaohongshu tech team, sharing tech innovations and problem insights, advancing together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
