Optimizing Multi-Dimensional User Count Statistics in Big Data Computing: A Data Tagging Approach
By replacing exponential row expansion with a data‑tagging strategy that encodes dimension combinations and aggregates at the user level, the authors cut Baidu Feed’s multi‑dimensional user‑count computation time from 49 to 14 minutes and shuffle size from 16 TB to 800 GB, enabling scalable analysis across dozens of dimensions for billions of daily users.
This article presents an optimization method for calculating multi-dimensional user counts in big data computing scenarios, specifically applied to Baidu's Feed business with billions of daily active users.
Background: The Feed business requires analyzing user counts across multiple dimensions (product line, payment type, resource type, channel type, page type). Each dimension needs both individual and aggregate (ALL) values, resulting in exponential data expansion when using traditional methods.
Traditional Approach: Using lateral view explode to expand data rows, then calculating distinct counts. This causes data explosion proportional to the square of dimension count (2^2=4x for 2 dimensions, 8x for 3 dimensions), leading to massive shuffle data (16TB in the case study) and long execution times (49 minutes).
Optimized Approach - Data Tagging Strategy:
1. Generate dimension encoding table: Create dimension combinations with ALL values, assign numeric IDs using DENSE_RANK()
2. Join encoding back to user details: Use MAPJOIN to associate dimension IDs with user-level data
3. Aggregate at user level: Use array_distinct to collect all dimension IDs for each user
4. Count and map back: Count occurrences of each dimension ID and join with original dimension mappings
Performance Results: Execution time reduced from 49 minutes to 14 minutes; shuffle data reduced from 16TB to 800GB. When dimensions increased from 9 to 12, the traditional approach would require 120TB shuffle (failing to execute), while the optimized approach completes successfully.
Code Implementation:
-- Generate dimension encoding table with DENSE_RANK()
with dim_res as (
select distinct appid_all, pay_type_all, r_type_all, tab_type_all, page_type_all,
dim_key, DENSE_RANK() OVER(ORDER BY appid_all,pay_type_all,r_type_all,tab_type_all,page_type_all) AS dim_id
from(...) lateral view explode(array(..., 'all')) B as ...
),
-- Aggregate user-level data with dimension encoding combinations
cuid_dim as(
select cuid, array_distinct(collect_set(dim_id_arry)) as click_dim_id_arry
from t0 join (select dim_key, collect_set(dim_id) as dim_id_arry from dim_res group by dim_key) t1
on t0.dim_key = t1.dim_key group by cuid
)
-- Final aggregation and dimension mapping
select appid_all, pay__type_all, r_type_all, tab_type_all, page_type_all, feed_dau
from (...) t0 join (select distinct appid_all, ..., dim_id from dim_res) t1 on t0.dim_id = t1.dim_id
Conclusion: The data tagging approach transforms the calculation from data expansion to data convergence. As dimensions increase, the performance advantage becomes more significant. This solution has been deployed in Baidu's Feed core scenarios, supporting 10+ dimensions with billions of users.
Baidu Geek Talk
Follow us to discover more Baidu tech insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.