Big Data 24 min read

How Xiaohongshu Boosted Data Warehouse Performance with Logical Datasets and Materialized Views

Xiaohongshu introduced logical datasets and materialized views to overcome low reuse of APP tables, limited scalability of single‑table BI datasets, and poor dashboard query performance, achieving higher data processing efficiency and faster query responses through optimized data flow, query pruning, and accelerated ETL scheduling.

dbaplus Community
dbaplus Community
dbaplus Community
How Xiaohongshu Boosted Data Warehouse Performance with Logical Datasets and Materialized Views

Background

Since the 2004 Google MapReduce paper, big‑data processing has become a core capability for internet companies. Xiaohongshu’s data warehouse follows a five‑layer model (ODS → DWD → DWS → DM → APP), runs batch jobs on Spark, serves data through OLAP engines, and uses an offline scheduler for BI tools and data products.

Challenges

Single‑table BI datasets scale poorly.

Wide‑table datasets cause slow dashboard queries.

Problem Analysis

The traditional processing model cannot simultaneously guarantee data universality and query performance. As the number of exported datasets grew, a method was needed to improve query speed without sacrificing reuse.

Logical Datasets

A logical dataset consists of a wide table (core metrics) and one or more dimension tables . Queries only join the dimensions required by the request, enabling query pruning and reducing unnecessary computation.

Execution Flow

Configure datasets via a graphical UI: each node represents a table or a SQL source, and edges define join keys.

Enable pruning when joins are LEFT JOIN and the right‑hand key is unique.

At execution time the system extracts the list of required nodes, removes pruned nodes, and joins the remaining tables.

Query‑Pruning Conditions

All joins must be LEFT JOIN.

The join key on the right side must have no duplicate values (uniqueness is verified during data preview).

Pruning Algorithm

From dataset metadata obtain the list of nodes that can be pruned.

Flatten the DSL (dimensions, measures, filters) to identify fields used in the current query and map them to node IDs.

Remove nodes that are not required, producing an optimized node list.

Join the remaining tables according to the defined join conditions and execute the BI query.

Materialized Views

Materialized views provide pre‑aggregated tables that accelerate dashboard queries while keeping the model count low. StarRocks offers synchronous and asynchronous materialized views; synchronous refresh incurs high load, while asynchronous refresh may lag behind data freshness. RedBI schedules materialized‑view jobs using its existing job scheduler, avoiding cluster contention.

Lifecycle

When upstream tables are refreshed, downstream materialized‑view tasks are triggered automatically.

Dashboard or template changes that invalidate a view generate alerts for re‑testing and updating.

Dataset field changes also trigger view‑invalidation notifications.

Materialized‑View Generation & ETL

After a view is configured in RedBI’s UI, a materialized‑ETL task is generated. The task extracts the aggregation dimensions and measures from the LodQuery representation. Simple measures (SUM, COUNT, MAX, MIN) are aggregated directly; ratio measures are split into numerator and denominator; UV metrics are materialized as BITMAP columns.

UV bitmap handling uses Roaring Bitmap. IDs are compacted into numeric buckets (e.g., 1‑1,000,000 per bucket) to improve compression and query speed. Containers smaller than 4 KB use Array storage, larger ones switch to Bitset and optionally to RunLen encoding.

Query Rewrite

When a user submits a query, RedBI converts the UI configuration into a QueryDSL, parses it into a LodQuery (which separates dimensions, measures, from, filters, etc.), and then matches the query against the list of materialized views sorted by priority. If a view matches, the LodQuery is rewritten to a MaterializedQuery, which is further optimized (predicate push‑down, window‑function handling) and finally translated to a TableQuery SQL sent to StarRocks.

Benefits

By early 2024 more than 100 logical datasets were deployed, accounting for ~30 % of all queries and surpassing SQL datasets. Over 40 datasets with materialized views achieved an average 80 % reduction in query latency (most queries < 4 s) and a 30 % hit rate on dashboard workloads.

Future Outlook

Planned improvements include:

Supporting multi‑granularity logical datasets for real‑time stream analysis.

Enhancing bitmap handling for high‑cardinality UV metrics (better encoding, bucket sizing).

Applying AI/ML to automate materialized‑view lifecycle management, selection, and tuning, further reducing engineering effort.

Architecture Diagrams

Data processing architecture
Data processing architecture
Logical dataset and materialized view architecture
Logical dataset and materialized view architecture
Logical dataset execution flow
Logical dataset execution flow
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Big DataQuery Optimizationlogical dataset
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.