Big Data 14 min read

Optimizing Multi‑Dimensional User Count Computation in Feed Using Data Tagging

By deduplicating logs and assigning compact numeric tags to each user‑dimension combination, the data‑tagging method replaces costly lateral‑view expansions with a user‑level aggregation, cutting shuffle volume from terabytes to gigabytes and reducing runtime from 49 minutes to 14 minutes, enabling scalable multi‑dimensional user‑count analysis for Baidu Feed.

Baidu Tech Salon

Nov 20, 2024

Optimizing Multi‑Dimensional User Count Computation in Feed Using Data Tagging

Feed is a core component of the Baidu App with daily active users (DAU) at the hundred‑million level. When performing data analysis, the product often needs to count users across many log dimensions (product line, pay type, resource type, channel type, page type, etc.). Because user counts from different dimensions are not additive, each dimension (and each combination of dimensions) must be calculated separately. When the number of dimensions is small, a simple COUNT(DISTINCT …) works, but with many dimensions the computation becomes extremely costly.

A typical business requirement is to obtain a result table like the one below (dimensions on the left, metric "user count" on the right). All dimensions must also include an "ALL" row representing the overall count.

Table omitted for brevity – it shows combinations of product line, pay type, resource type, channel type, page type and the corresponding user counts (e.g., 2 million, 3 million, …).

Common implementation : directly calculate each dimension or dimension combination using distinct counts. This approach expands the data dramatically (the "data explosion" problem) because the original log rows are duplicated for every dimension combination before aggregation.

Core optimization idea – Data Tagging : Instead of expanding the data, first deduplicate the log by cuid (user identifier) and the original dimensions, then generate an intermediate user‑level dataset. For each required dimension combination, assign a numeric tag (dimension ID) and attach it to the user record. This "tagging" keeps the processing focused on user‑level data, avoids data explosion, and reduces shuffle volume.

2.1 Core concept : Create a mapping from each original dimension value (or combination) to a compact numeric ID using DENSE_RANK(). Store the list of IDs for each user in an array, then aggregate the arrays to obtain distinct dimension IDs per user.

2.2 Code implementation – Direct calculation (the naive approach):

-- 表名：feed_dws_kpi_dau_1d
-- 字段名及注释：appid##产品线，pay_type##付费类型，r_type##资源类型，tab_type##频道类型，page_type##页面类型，cuid##用户标识
select
    appid_all,  -- 产品线
    pay_type_all, -- 付费类型
    r_type_all,  -- 资源类型
    tab_type_all,  -- 频道类型
    page_type_all,  -- 页面类型
    count(distinct cuid) as feed_dau
from(
    select
        cuid,
        appid,
        pay_type,
        r_type,
        tab_type,
        page_type
    from feed_dws_kpi_dau_1d
    group by 1,2,3,4,5,6
) tab
lateral view explode(array(appid, 'all')) B as appid_all 
lateral view explode(array(pay_type, 'all')) B as pay_type_all 
lateral view explode(array(r_type, 'all')) B as r_type_all 
lateral view explode(array(tab_type, 'all')) B as tab_type_all
lateral view explode(array(page_type, 'all')) B as page_type_all
group by 1,2,3,4,5

2.2 Code implementation – Data‑tagging approach (optimized):

-- 基于明细数据产出维度结果数据，并进行编码
with dim_res as (
    select
        distinct
        appid_all,  -- 产品线
        pay_type_all, -- 付费类型
        r_type_all,  -- 资源类型
        tab_type_all,  -- 频道类型
        page_type_all,  -- 页面类型
        dim_key,
        DENSE_RANK() OVER(ORDER BY appid_all,pay_type_all,r_type_all,tab_type_all,page_type_all) AS dim_id 
    from(
        select
            appid,
            pay_type,
            r_type,
            tab_type,
            page_type,
            concat_ws('#', coalesce(appid,'unknow'), coalesce(pay_type,'unknow'), coalesce(r_type,'unknow'), coalesce(tab_type,'unknow'), coalesce(page_type,'unknow')) as dim_key
        from feed_dws_kpi_dau_1d
        group by 1,2,3,4,5,6
    ) t0
    lateral view explode(array(appid, 'all')) B as appid_all 
    lateral view explode(array(pay_type, 'all')) B as pay_type_all 
    lateral view explode(array(r_type, 'all')) B as r_type_all 
    lateral view explode(array(tab_type, 'all')) B as tab_type_all
    lateral view explode(array(page_type, 'all')) B as page_type_all
),

cuid_dim as(
    select /*+ MAPJOIN(t1) */
        cuid,
        array_distinct(split(concat_ws(',',collect_set(concat_ws(',',dim_id_arry))),',')) as click_dim_id_arry
    from(
        select
            cuid,
            concat_ws('#', coalesce(appid,'unknow'), coalesce(pay_type,'unknow'), coalesce(r_type,'unknow'), coalesce(tab_type,'unknow'), coalesce(page_type,'unknow')) as dim_key
        from feed_dws_kpi_dau_1d
        group by 1,2
    ) t0
    join (
        select
            dim_key,
            collect_set(dim_id) as dim_id_arry
        from dim_res
        group by dim_key
    ) t1 on t0.dim_key = t1.dim_key
    group by cuid
),

-- 将维度编码回写为原始日志
select /*+ MAPJOIN(t1) */
    appid_all,  -- 产品线
    pay_type_all, -- 付费类型
    r_type_all,  -- 资源类型
    tab_type_all,  -- 频道类型
    page_type_all,  -- 页面类型
    feed_dau
from(
    select
        dim_id,
        sum(feed_dau) as feed_dau
    from(
        select
            concat_ws(',',click_dim_id_arry) as dim_id_str,
            count(1) as feed_dau
        from cuid_dim
        group by 1
    ) tab
    lateral view explode(split(dim_id_str,',')) B as dim_id
    group by dim_id
) t0
join (
    select 
        distinct
        appid_all,  -- 产品线
        pay_type_all, -- 付费类型
        r_type_all,  -- 资源类型
        tab_type_all,  -- 频道类型
        page_type_all,  -- 页面类型
        dim_id 
    from dim_res
) t1 on t0.dim_id = t1.dim_id
order by 1,2,3,4,5,6

Performance comparison :

Using lateral view + distinct (data explosion) the job runs for 49 minutes, with a shuffle volume of ~16 TB (the expansion stage grows 3.3 billion rows to 1707 billion rows).

Using the dimension‑encoding (data‑tagging) method the job finishes in 14 minutes, with a maximum shuffle of ~800 GB (only the user‑level dimension‑ID aggregation). The dominant cost is the sorting of dimension IDs, which can be pre‑computed.

When the number of dimensions increases from 9 to 12, the explosion method would require ~120 TB of shuffle and fails, while the tagging method still completes with similar runtime (still dominated by the encoding stage).

Conclusion :

The data‑tagging approach dramatically reduces data explosion and shuffle traffic, making multi‑dimensional user‑count calculations scalable to dozens of dimensions and billions of users. It has been deployed in the Feed core scenario and short‑video business, supporting >10 dimensions. Future work includes encapsulating complex logic into UDFs (array aggregation, deduplication) and pre‑computing dimension encodings for recurring jobs to further improve readability and cost.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Tuning Hive sql-optimization data tagging dimensional aggregation

Written by

Baidu Tech Salon

Baidu Tech Salon, organized by Baidu's Technology Management Department, is a monthly offline event that shares cutting‑edge tech trends from Baidu and the industry, providing a free platform for mid‑to‑senior engineers to exchange ideas.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.