How Xiaohongshu Boosted Data Warehouse Performance with Logical Datasets and Materialized Views
Xiaohongshu introduced logical datasets and materialized views to overcome low reuse of APP tables, limited scalability of single‑table BI datasets, and poor dashboard query performance, achieving higher data processing efficiency and faster query responses through optimized data flow, query pruning, and accelerated ETL scheduling.
Background
Since the 2004 Google MapReduce paper, big‑data processing has become a core capability for internet companies. Xiaohongshu’s data warehouse follows a five‑layer model (ODS → DWD → DWS → DM → APP), runs batch jobs on Spark, serves data through OLAP engines, and uses an offline scheduler for BI tools and data products.
Challenges
Single‑table BI datasets scale poorly.
Wide‑table datasets cause slow dashboard queries.
Problem Analysis
The traditional processing model cannot simultaneously guarantee data universality and query performance. As the number of exported datasets grew, a method was needed to improve query speed without sacrificing reuse.
Logical Datasets
A logical dataset consists of a wide table (core metrics) and one or more dimension tables . Queries only join the dimensions required by the request, enabling query pruning and reducing unnecessary computation.
Execution Flow
Configure datasets via a graphical UI: each node represents a table or a SQL source, and edges define join keys.
Enable pruning when joins are LEFT JOIN and the right‑hand key is unique.
At execution time the system extracts the list of required nodes, removes pruned nodes, and joins the remaining tables.
Query‑Pruning Conditions
All joins must be LEFT JOIN.
The join key on the right side must have no duplicate values (uniqueness is verified during data preview).
Pruning Algorithm
From dataset metadata obtain the list of nodes that can be pruned.
Flatten the DSL (dimensions, measures, filters) to identify fields used in the current query and map them to node IDs.
Remove nodes that are not required, producing an optimized node list.
Join the remaining tables according to the defined join conditions and execute the BI query.
Materialized Views
Materialized views provide pre‑aggregated tables that accelerate dashboard queries while keeping the model count low. StarRocks offers synchronous and asynchronous materialized views; synchronous refresh incurs high load, while asynchronous refresh may lag behind data freshness. RedBI schedules materialized‑view jobs using its existing job scheduler, avoiding cluster contention.
Lifecycle
When upstream tables are refreshed, downstream materialized‑view tasks are triggered automatically.
Dashboard or template changes that invalidate a view generate alerts for re‑testing and updating.
Dataset field changes also trigger view‑invalidation notifications.
Materialized‑View Generation & ETL
After a view is configured in RedBI’s UI, a materialized‑ETL task is generated. The task extracts the aggregation dimensions and measures from the LodQuery representation. Simple measures (SUM, COUNT, MAX, MIN) are aggregated directly; ratio measures are split into numerator and denominator; UV metrics are materialized as BITMAP columns.
UV bitmap handling uses Roaring Bitmap. IDs are compacted into numeric buckets (e.g., 1‑1,000,000 per bucket) to improve compression and query speed. Containers smaller than 4 KB use Array storage, larger ones switch to Bitset and optionally to RunLen encoding.
Query Rewrite
When a user submits a query, RedBI converts the UI configuration into a QueryDSL, parses it into a LodQuery (which separates dimensions, measures, from, filters, etc.), and then matches the query against the list of materialized views sorted by priority. If a view matches, the LodQuery is rewritten to a MaterializedQuery, which is further optimized (predicate push‑down, window‑function handling) and finally translated to a TableQuery SQL sent to StarRocks.
Benefits
By early 2024 more than 100 logical datasets were deployed, accounting for ~30 % of all queries and surpassing SQL datasets. Over 40 datasets with materialized views achieved an average 80 % reduction in query latency (most queries < 4 s) and a 30 % hit rate on dashboard workloads.
Future Outlook
Planned improvements include:
Supporting multi‑granularity logical datasets for real‑time stream analysis.
Enhancing bitmap handling for high‑cardinality UV metrics (better encoding, bucket sizing).
Applying AI/ML to automate materialized‑view lifecycle management, selection, and tuning, further reducing engineering effort.
Architecture Diagrams
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
