Big Data 12 min read

Optimizing Multi-Dimensional User Count Statistics in Big Data Computing: A Data Tagging Approach

By replacing exponential row expansion with a data‑tagging strategy that encodes dimension combinations and aggregates at the user level, the authors cut Baidu Feed’s multi‑dimensional user‑count computation time from 49 to 14 minutes and shuffle size from 16 TB to 800 GB, enabling scalable analysis across dozens of dimensions for billions of daily users.

Baidu Geek Talk

Nov 18, 2024

Optimizing Multi-Dimensional User Count Statistics in Big Data Computing: A Data Tagging Approach

This article presents an optimization method for calculating multi-dimensional user counts in big data computing scenarios, specifically applied to Baidu's Feed business with billions of daily active users.

Background: The Feed business requires analyzing user counts across multiple dimensions (product line, payment type, resource type, channel type, page type). Each dimension needs both individual and aggregate (ALL) values, resulting in exponential data expansion when using traditional methods.

Traditional Approach: Using lateral view explode to expand data rows, then calculating distinct counts. This causes data explosion proportional to the square of dimension count (2^2=4x for 2 dimensions, 8x for 3 dimensions), leading to massive shuffle data (16TB in the case study) and long execution times (49 minutes).

Optimized Approach - Data Tagging Strategy:

1. Generate dimension encoding table: Create dimension combinations with ALL values, assign numeric IDs using DENSE_RANK()

2. Join encoding back to user details: Use MAPJOIN to associate dimension IDs with user-level data

3. Aggregate at user level: Use array_distinct to collect all dimension IDs for each user

4. Count and map back: Count occurrences of each dimension ID and join with original dimension mappings

Performance Results: Execution time reduced from 49 minutes to 14 minutes; shuffle data reduced from 16TB to 800GB. When dimensions increased from 9 to 12, the traditional approach would require 120TB shuffle (failing to execute), while the optimized approach completes successfully.

Code Implementation:

-- Generate dimension encoding table with DENSE_RANK()

with dim_res as (

select distinct appid_all, pay_type_all, r_type_all, tab_type_all, page_type_all,

dim_key, DENSE_RANK() OVER(ORDER BY appid_all,pay_type_all,r_type_all,tab_type_all,page_type_all) AS dim_id

from(...) lateral view explode(array(..., 'all')) B as ...

),

-- Aggregate user-level data with dimension encoding combinations

cuid_dim as(

select cuid, array_distinct(collect_set(dim_id_arry)) as click_dim_id_arry

from t0 join (select dim_key, collect_set(dim_id) as dim_id_arry from dim_res group by dim_key) t1

on t0.dim_key = t1.dim_key group by cuid

-- Final aggregation and dimension mapping

select appid_all, pay__type_all, r_type_all, tab_type_all, page_type_all, feed_dau

from (...) t0 join (select distinct appid_all, ..., dim_id from dim_res) t1 on t0.dim_id = t1.dim_id

Conclusion: The data tagging approach transforms the calculation from data expansion to data convergence. As dimensions increase, the performance advantage becomes more significant. This solution has been deployed in Baidu's Feed core scenarios, supporting 10+ dimensions with billions of users.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Optimization data tagging dimension encoding Hive SQL lateral view explode user statistics

Written by

Baidu Geek Talk

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.