Big Data 14 min read

How Didi Scaled Real‑Time Funnel Analysis with StarRocks: Architecture, Design, and Performance Tips

Didi's data architecture team migrated high‑volume, real‑time funnel analysis from ClickHouse to StarRocks, built a multi‑layer pipeline with Kafka, Flink/Spark, Hive, and materialized views, and achieved sub‑3‑second query times on billions of rows, while outlining future enhancements.

StarRocks

Sep 24, 2021

Demand Introduction

Existing funnel analysis dashboards were fragmented, each supporting only a single scenario, lacking flexible step selection, drill‑down, and cross‑scenario comparison. The team needed a unified tool that could handle massive traffic data, allow dynamic dimension filtering, support custom funnel steps, and pinpoint churn and conversion users for precise operational optimization.

Technical Selection

StarRocks was chosen over ClickHouse due to better usability, maintainability, and superior SQL monitoring and operations features. Its ability to create materialized views on detailed tables enables faster multi‑dimensional analysis.

System Architecture

The pipeline consists of six layers:

Data Sources : Web and client event logs.

Data Ingestion Layer : A data bus validates and cleans logs, then pushes them to a Kafka cluster, which decouples ingestion from downstream processing.

Data Compute & Storage Layer : Kafka data is processed by Flink or Spark for real‑time and batch ETL, then stored in StarRocks.

Data Service Layer : Provides unified metric definitions and query interfaces for both offline and real‑time access.

Funnel Analysis System : Enables flexible funnel creation, editing, viewing, and export of detailed data.

Data Middle‑Platform : Offers metadata management, data catalog, and job scheduling to improve data production and consumption efficiency.

Detailed Design – Global Dictionary for Bitmap Aggregation

StarRocks bitmap columns accept only integer values, but the original user_id contains alphanumeric strings. To enable bitmap calculations, a global dictionary mapping each original user_id to a unique integer ID was built using Hive tables.

Steps:

Create a temporary table with distinct user_id values:

create table 'temp_table' { 'user_id' string COMMENT '原始表去重后的用户ID' }

Insert distinct IDs:

insert overwrite table temp_table select user_id from fact_log_user_hive_table group by user_id

Define the global dictionary table:

create table 'global_dict_by_userid_hive_table' { 'user_id' string COMMENT '原始用户ID', 'new_user_id' int COMMENT '对原始用户ID编码后的整型用户ID' }

Left‑join the temporary table with the dictionary, assign new IDs using the current max ID plus row number, and upsert into the dictionary.

--4 更新Hive字典表
insert overwrite global_dict_by_userid_hive_table
select user_id, new_user_id from global_dict_by_userid_hive_table
--3 与历史的字段数据求并集
union all select t1.user_id,
(row_number() over(order by t1.user_id) + t2.max_id) as new_user_id
--1 获得新增的去重值集合
from (select user_id from temp_table where user_id is not null) t1
left join (select user_id, new_user_id, (max(new_user_id) over()) as max_id from global_dict_by_userid_hive_table) t2
on t1.user_id = t2.user_id where t2.new_user_id is null

Update the original fact table with the new integer IDs:

insert overwrite fact_log_user_hive_table
select a.user_id, b.new_user_id
from fact_log_user_hive_table a left join global_dict_by_userid_hive_table b on a.user_id=b.user_id

Create the StarRocks detail table and sync data via Spark:

CREATE TABLE `fact_log_user_doris_table` (
  `new_user_id` bigint(20) NULL COMMENT "整型用户id",
  `user_id` varchar(65533) NULL COMMENT "用户id",
  `event_source` varchar(65533) NULL COMMENT "端（1：商城小程序 2：团长小程序 3：独立APP 4：主端）",
  `is_new` varchar(65533) NULL COMMENT "是否新用户",
  `identity` varchar(65533) NULL COMMENT "用户身份(团长或者普通用户)",
  `biz_channel_name` varchar(65533) NULL COMMENT "当天首次落地页渠道名称",
  `pro_id` varchar(65533) NULL COMMENT "省ID",
  `pro_name` varchar(65533) NULL COMMENT "省名称",
  `city_id` varchar(65533) NULL COMMENT "城市ID",
  `city_name` varchar(65533) NULL COMMENT "城市名称",
  `dt` date NULL COMMENT "分区",
  `period_type` varchar(65533) NULL DEFAULT "daily" COMMENT ""
) ENGINE=OLAP DUPLICATE KEY(`index_id`,`user_id`,`biz_channel_name`,`pro_id`,`city_id`) PARTITION BY RANGE(`dt`)(
  PARTITION p20210731 VALUES [('2021-07-31'), ('2021-08-01')),
  ...
) DISTRIBUTED BY HASH(`index_id`,`user_id`) BUCKETS 10 PROPERTIES (
  "replication_num" = "3",
  "dynamic_partition.enable" = "true",
  "dynamic_partition.time_unit" = "DAY",
  "dynamic_partition.time_zone" = "Asia/Shanghai",
  "dynamic_partition.start" = "-2147483648",
  "dynamic_partition.end" = "1",
  "dynamic_partition.prefix" = "p",
  "dynamic_partition.replication_num" = "-1",
  "dynamic_partition.buckets" = "3",
  "in_memory" = "false",
  "storage_format" = "DEFAULT"
);

To accelerate distinct count queries, a materialized view with bitmap_union was created:

create materialized view city_user_count as
select city_id, bitmap_union(to_bitmap(new_user_id))
from fact_log_user_doris_table
group by city_id;

StarRocks guarantees that count(distinct) and bitmap_union_count produce identical results, while the bitmap‑based view dramatically reduces query latency.

Final Effect and Benefits

After optimization, precise distinct‑count queries across 150 city IDs on billions of rows complete in under 3 seconds. The funnel analysis dashboard now supports flexible step selection, multi‑dimensional drilling, and rapid insight generation for product and operations teams.

Future Plans

Integrate StarRocks tooling with Didi’s internal data scheduling and development platforms for one‑click ingestion from MySQL, Elasticsearch, Hive, etc.

Advance StarRocks’ streaming‑batch unified model, leveraging both update and detail models with materialized views, and roll it out across all Orange Heart products.

Combine StarRocks with Elasticsearch to enable unified OLAP queries over heterogeneous data sources, accelerating data‑driven value creation.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data data pipeline StarRocks Hive Spark materialized view Funnel Analysis

Written by

StarRocks

StarRocks is an open‑source project under the Linux Foundation, focused on building a high‑performance, scalable analytical database that enables enterprises to create an efficient, unified lake‑house paradigm. It is widely used across many industries worldwide, helping numerous companies enhance their data analytics capabilities.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.