How to Cut Data Cube Processing Time by 60% with Deduplication Optimization
This article explains how to dramatically reduce the cost of deduplication‑Cube calculations in large‑scale data pipelines by replacing costly data‑expansion steps with a UID‑level tagging approach, detailing the scenario, common methods, performance analysis, a new solution, implementation steps, and experimental results.
SQL is the most widely used query language, but poorly written statements can be dozens or hundreds of times slower than optimized ones; writing performant, maintainable SQL requires deep understanding of data processing.
The article introduces a series called “Creative SQL” and presents the first case: optimizing a deduplication‑Cube used in Ant Group’s executive data‑line upgrade, where distinct counts (e.g., unique users) must be computed across many dimension combinations.
Scenario Description
When aggregating metrics such as daily payment users by province, city, and district, each dimension combination requires a separate distinct count. With large data volumes, naïve per‑combination calculations become infeasible.
Common Implementation Methods
Typical approaches include:
Calculating each dimension combination separately (multiple tables).
Data‑expansion techniques such as UNION ALL , LATERAL VIEW EXPLODE , or MaxCompute’s CUBE function, which duplicate rows for every required combination and then apply a standard DISTINCT aggregation.
All these methods share the same core idea: expand the data, then deduplicate, leading to similar performance but differing maintainability.
Performance Analysis
Experiments show that over 80% of the runtime is spent on data expansion and shuffling. For a benchmark of 10 billion rows across 25 dimension combos, the intermediate data grew from 100 GB to over 1 TB (≈10× expansion), and the total execution time reached 47 minutes.
A New Approach
The proposed solution avoids data expansion by first tagging each record with a unique user‑level identifier (UID) and encoding the required dimension combos as compact numeric codes. This creates a converging data flow where the intermediate size does not increase with the number of dimensions.
Logical Implementation
Prepare detail data : collect order number, user ID, payment date, province, city, amount.
Generate Cube and encode rows : use DENSE_RANK to assign a cube_id for each dimension combination (province only, province+city, etc.).
Write Cube IDs back to detail records (MapJoin) and aggregate per UID to obtain cube_id_arry.
Aggregate by UID and deduplicate the cube_id_arry (e.g., ARRAY_DISTINCT).
Translate Cube IDs back to readable dimensions and compute final distinct counts.
Code Implementation
WITH base_dwd AS (SELECT pay_no, user_id, gmt_pay, pay_amt, prov_name, prov_code, city_name, city_code FROM tmp_user_pay_order_detail), dim_cube AS (SELECT *, DENSE_RANK() OVER (PARTITION BY 1 ORDER BY cube_prov_name, cube_city_name) AS cube_id FROM (SELECT dim_key, COALESCE(IF(GROUPING(prov_name)=0, prov_name, 'ALL'), 'na') AS cube_prov_name, COALESCE(IF(GROUPING(city_name)=0, city_name, 'ALL'), 'na') AS cube_city_name FROM (SELECT CONCAT(COALESCE(prov_name,''), '#', COALESCE(city_name,''), '#') AS dim_key, prov_name, city_name FROM base_dwd GROUP BY prov_name, city_name) base GROUP BY dim_key, prov_name, city_name GROUPING SETS ((dim_key, prov_name), (dim_key, prov_name, city_name))) , detail_ext AS (SELECT user_id, ARRAY_DISTINCT(SPLIT(WM_CONCAT(';', cube_ids), ';')) AS cube_id_arry FROM (SELECT user_id, cube_ids FROM (SELECT user_id, CONCAT(COALESCE(prov_name,''), '#', COALESCE(city_name,''), '#') AS dim_key FROM base_dwd) dwd_detail JOIN (SELECT dim_key, WM_CONCAT(';', cube_id) AS cube_ids FROM dim_cube GROUP BY dim_key) dim_cube ON dwd_detail.dim_key = dim_cube.dim_key) GROUP BY user_id), base_dws AS (SELECT cube_id, MAX(prov_name) AS prov_name, MAX(city_name) AS city_name, MAX(user_cnt) AS user_cnt FROM (SELECT cube_id, COUNT(1) AS user_cnt, CAST(NULL AS STRING) AS prov_name, CAST(NULL AS STRING) AS city_name FROM detail_ext LATERAL VIEW EXPLODE(cube_id_arry) arr AS cube_id GROUP BY cube_id UNION ALL SELECT CAST(cube_id AS STRING) AS cube_id, CAST(NULL AS BIGINT) AS user_cnt, cube_prov_name AS prov_name, cube_city_name AS city_name FROM dim_cube) GROUP BY cube_id) SELECT prov_name, city_name, user_cnt FROM base_dws;Experimental Results
The new tagging pipeline processed 200 billion rows with 50 dimension combos in about 18 minutes, while the traditional expansion method would require >2.5 hours for the same workload (estimated 3 TB intermediate data). The new method therefore reduces runtime by more than 60% and scales linearly with data size.
Other Solutions
Two additional techniques are discussed:
Bitmap : encode user IDs into a bitmap where each bit represents presence; suitable for exact distinct counts but adds encoding complexity.
HyperLogLog : probabilistic distinct counting with much lower memory usage, at the cost of accuracy.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
