Big Data 19 min read

Real-time Precise Deduplication Using StarRocks Materialized Views at Didi

Didi leverages StarRocks materialized views with a global dictionary and bitmap aggregation to perform real‑time, high‑cardinality precise deduplication, automatically rewriting queries and refreshing views, cutting query latency by ~80%, reducing resource use ~95%, and boosting concurrent QPS up to 100‑fold, while planning further automation and bitmap optimizations.

Didi Tech

Feb 27, 2024

Real-time Precise Deduplication Using StarRocks Materialized Views at Didi

Didi introduced StarRocks in 2022 and, after more than a year of effort, it has become the company's primary OLAP engine. By December 2023, Didi operates over 40 StarRocks clusters, handling daily query volumes in the tens of millions and supporting almost all business lines such as ride‑hailing, bike‑sharing, energy, and freight. This article focuses on the practical application of StarRocks materialized views for precise deduplication.

Real‑time data insight is indispensable for business management. Didi's ride‑hailing real‑time dashboard is a critical monitoring tool, containing more than 20 key metrics (e.g., real‑time call volume, bubble count, GMV). These metrics enable rapid response to market fluctuations and informed decision‑making. However, calculating these metrics often requires massive precise distinct counting (count‑distinct) on high‑cardinality string identifiers, which puts extreme pressure on resources during peak traffic.

Previously, Didi relied on Druid with approximate deduplication to reduce cost, but this introduced calculation errors and could not sustain the high concurrency of large‑scale promotional events, leading to performance degradation or system crashes.

To address these challenges, Didi adopted StarRocks materialized views, which provide high‑concurrency precise deduplication with controllable resource consumption.

The dashboard workload has three notable characteristics:

High‑cardinality precise deduplication: Daily incremental data reaches billions of rows, and distinct counting is performed on string user or order IDs.

Flexible dimension filtering: Queries involve more than ten dimension fields (time, city, business line, etc.) with thousands of possible dimension combinations.

High query concurrency: During large promotions, thousands of users may query the dashboard simultaneously, generating hundreds of QPS. Directly querying raw detail tables would be prohibitively expensive.

StarRocks offers three key capabilities that solve these problems:

Materialized query results: A materialized view caches the result of a SQL query as a physical table, providing superior performance under high concurrency compared to traditional indexes.

Managed refresh process: Views can be refreshed automatically on a schedule or incrementally at the partition level, reducing refresh cost.

Transparent query rewrite: The engine automatically rewrites user queries to hit the appropriate materialized view without user awareness.

The overall optimization pipeline consists of five layers:

Data preprocessing: A global dictionary converts high‑cardinality string columns to BIGINT IDs, enabling efficient bitmap aggregation.

ODS layer: Stores raw detail data in a StarRocks table.

DWD layer: Builds an incremental aggregation layer using synchronous materialized views.

ADS layer: Constructs a transparent acceleration layer with asynchronous materialized views.

Acceleration strategy: Leverages intelligent rewrite to automatically select the best view for a given query.

Data preprocessing details :

A global dictionary table is created with an auto‑increment BIGINT primary key and the original STRING column as the key. Each distinct string receives a unique BIGINT.

A user‑defined function dict_mapping(dict_table, key) looks up the dictionary at query time, returning the BIGINT ID. The function uses StarRocks primary‑key indexing for high performance.

Flink‑StarRocks connector is modified to write first to the dictionary table (ensuring id generation) and then to the detail model table, applying dict_mapping to replace strings with IDs.

Example of creating a synchronous materialized view for the DWD layer (bitmap aggregation):

CREATE MATERIALIZED VIEW mv_dwd AS
SELECT dt,
       time_slice(`table_time`, INTERVAL 5 minute, floor) AS `ts`,
       city,
       source_type,
       bitmap_union(to_bitmap(order_id))
FROM base_table
GROUP BY dt, ts, city, source_type;

Example of an asynchronous materialized view for the ADS layer:

CREATE MATERIALIZED VIEW `mv_ads`
PARTITION BY (dt)
DISTRIBUTED BY HASH(`ts`) BUCKETS 1
REFRESH ASYNC START("2023-06-28 21:00:00") EVERY(INTERVAL 30 SECOND) PROPERTIES (
  "partition_refresh_number" = "3"
)
AS SELECT `dt`,
          time_slice(`table_time`, INTERVAL 5 minute, floor) AS `ts`,
          `city`,
          `source_type`,
          count(DISTINCT `order_id`) AS `order_num`
FROM `base_table`
GROUP BY `dt`, `ts`, `city`, `source_type`;

To limit the number of asynchronous views, dimensions are classified as additive or non‑additive. Only one view is needed for each combination of non‑additive dimensions, dramatically reducing view count from 2^N to 2^(N‑M), where M is the number of additive dimensions.

Transparent rewrite example:

SELECT ts, SUM(order_num)
FROM (
  SELECT time_slice(`table_time`, interval 5 minute) AS ts,
         count(DISTINCT `order_id`) AS `order_num`
  FROM `base_table`
  WHERE (...)
  GROUP BY `dt`, ts, `city`, `source_type`
) sub
WHERE dt = '2023-07-01'
GROUP BY ts;

The engine automatically rewrites the inner count(DISTINCT order_id) to bitmap_union_count(to_bitmap(order_id)) and routes the query to the appropriate materialized view.

Performance impact :

Single‑query latency reduced by ~80% and resource consumption lowered by ~95%.

Supported QPS increased by up to 100× on the same cluster size.

Precise deduplication queries now handle hundreds of concurrent requests, solving the previous Druid limitation of only tens of concurrent approximate deduplication queries.

Current limitations and future work include the complexity of view management, lack of strong consistency for async refreshes, and opportunities to improve bitmap performance (e.g., range‑based bitmap partitioning, Roaring bitmap fast‑union). Didi plans to collaborate with the StarRocks community to:

Enhance bitmap computation efficiency.

Reduce resource waste in async view refreshes.

Automate view creation based on high‑frequency query patterns, improving usability and scalability.

Develop a more efficient global dictionary that caches directly in BE/CN memory.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data StarRocks OLAP Materialized Views Precise Deduplication

Written by

Didi Tech

Official Didi technology account

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.