Databases 20 min read

How StarRocks Materialized Views Enable High‑Concurrency Precise Deduplication

StarRocks’ materialized view feature lets Didi replace costly fuzzy deduplication with precise, high‑concurrency deduplication for real‑time dashboards, using global dictionary mapping, layered ODS/DWD/ADS views, synchronous and asynchronous refreshes, and transparent query rewrite to cut query latency by 80% and boost QPS dramatically.

StarRocks
StarRocks
StarRocks
How StarRocks Materialized Views Enable High‑Concurrency Precise Deduplication

In Didi’s real‑time dashboard scenario, billions of rows are generated daily and many key metrics require exact distinct‑count calculations. Traditional OLAP engines such as Druid resorted to fuzzy deduplication to reduce resource consumption, but this introduced significant errors and could not sustain the high query concurrency of large‑scale promotional events.

To address this, Didi adopted StarRocks as its unified OLAP engine and built a multi‑layered solution centered on materialized views that provides precise deduplication while keeping resource usage under control.

1. Data Pre‑processing with a Global Dictionary

Incoming raw data from the data warehouse is first synchronized to StarRocks via Flink. Inside StarRocks a global dictionary converts high‑cardinality STRING columns (e.g., user_id, order_id) to BIGINT IDs. The conversion uses the partial_update feature of primary‑key tables so that each distinct string receives a monotonically increasing integer, enabling efficient BITMAP aggregation.

CREATE MATERIALIZED VIEW mv_dwd AS
SELECT dt,
       time_slice(`table_time`, INTERVAL 5 minute, floor) AS ts,
       city,
       source_type,
       bitmap_union(to_bitmap(order_id))
FROM base_table
GROUP BY dt, ts, city, source_type;

2. ODS Layer – Storing Raw Detail Data

The transformed rows are stored in an ODS (Operational Data Store) layer that retains the original detail level. This layer serves as the source for downstream aggregation.

3. DWD Layer – Synchronous Materialized Views for Incremental Aggregation

At the DWD (Data Warehouse Detail) layer, synchronous materialized views are created for each dimension combination that requires distinct‑count. These views act like indexes: they are refreshed in real time, keep strong consistency, and store intermediate BITMAP results. Queries that contain count(distinct order_id) are automatically rewritten to bitmap_union_count(to_bitmap(order_id)) and hit the view.

4. ADS Layer – Asynchronous Materialized Views for Transparent Acceleration

For the final presentation layer (ADS), asynchronous materialized views are defined with a timed refresh (e.g., every 30 seconds). Although they may lag behind the source by a short window, they store the final aggregated result, acting as a persistent query cache. The refresh strategy reduces the cost of recomputing the same aggregation repeatedly.

CREATE MATERIALIZED VIEW `mv_ads`
PARTITION BY (dt)
DISTRIBUTED BY HASH(`ts`) BUCKETS 1
REFRESH ASYNC START("2023-06-28 21:00:00") EVERY(INTERVAL 30 SECOND)
PROPERTIES ("partition_refresh_number" = "3")
AS SELECT dt,
          time_slice(`table_time`, INTERVAL 5 minute, floor) AS ts,
          city,
          source_type,
          count(DISTINCT order_id) AS order_num
FROM base_table
GROUP BY dt, ts, city, source_type;

5. Accelerating Strategy – Transparent Query Rewrite

StarRocks’ query engine analyses incoming SQL, matches it against the existing materialized views, and rewrites the plan to use the most appropriate view (async → sync → raw). This process is completely transparent to the user and guarantees identical query semantics while delivering orders‑of‑magnitude speedups.

Example rewrite:

SELECT ts, SUM(order_num)
FROM (
  SELECT time_slice(`table_time`, INTERVAL 5 minute) AS ts,
         count(DISTINCT `order_id`) AS order_num
  FROM base_table
  WHERE ...
  GROUP BY dt, ts, city, source_type
) sub
WHERE dt = '2023-07-01'
GROUP BY ts;

The engine will automatically replace the inner count(DISTINCT order_id) with the pre‑computed bitmap aggregation from the corresponding materialized view.

6. Results and Limitations

Single‑query latency reduced by ~80%.

CPU consumption for deduplication lowered by ~95%.

Supported QPS increased by roughly 100× on the same cluster.

Complex data pipeline requiring manual view definition and maintenance.

Asynchronous views cannot guarantee strong consistency and may refresh even when dashboards are idle, causing some resource waste.

7. Future Directions

Improve BITMAP performance by adopting range‑based bucketization and leveraging Roaring BITMAP’s fastunion function.

Reduce resource overhead of async view refreshes through smarter partition‑level incremental updates.

Automate materialized view generation by mining frequent query patterns from historical logs.

Cache global dictionary tables in BE/CN memory to cut network latency for dictionary lookups.

Overall, the combination of global dictionary conversion, layered materialized views, and transparent query rewrite enables Didi to achieve high‑concurrency, precise deduplication for real‑time business monitoring, turning a previously prohibitive workload into a scalable, cost‑effective solution.

big dataStarRocksOLAPMaterialized ViewsPrecise DeduplicationQuery Acceleration
StarRocks
Written by

StarRocks

StarRocks is an open‑source project under the Linux Foundation, focused on building a high‑performance, scalable analytical database that enables enterprises to create an efficient, unified lake‑house paradigm. It is widely used across many industries worldwide, helping numerous companies enhance their data analytics capabilities.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.