Big Data 15 min read

How to Cut Data Cube Processing Time by 60% with Deduplication Optimization

This article explains how to dramatically reduce the cost of deduplication‑Cube calculations in large‑scale data pipelines by replacing costly data‑expansion steps with a UID‑level tagging approach, detailing the scenario, common methods, performance analysis, a new solution, implementation steps, and experimental results.

Alibaba Cloud Developer

Dec 25, 2023

How to Cut Data Cube Processing Time by 60% with Deduplication Optimization

SQL is the most widely used query language, but poorly written statements can be dozens or hundreds of times slower than optimized ones; writing performant, maintainable SQL requires deep understanding of data processing.

The article introduces a series called “Creative SQL” and presents the first case: optimizing a deduplication‑Cube used in Ant Group’s executive data‑line upgrade, where distinct counts (e.g., unique users) must be computed across many dimension combinations.

Scenario Description

When aggregating metrics such as daily payment users by province, city, and district, each dimension combination requires a separate distinct count. With large data volumes, naïve per‑combination calculations become infeasible.

Common Implementation Methods

Typical approaches include:

Calculating each dimension combination separately (multiple tables).

Data‑expansion techniques such as UNION ALL , LATERAL VIEW EXPLODE , or MaxCompute’s CUBE function, which duplicate rows for every required combination and then apply a standard DISTINCT aggregation.

All these methods share the same core idea: expand the data, then deduplicate, leading to similar performance but differing maintainability.

Performance Analysis

Experiments show that over 80% of the runtime is spent on data expansion and shuffling. For a benchmark of 10 billion rows across 25 dimension combos, the intermediate data grew from 100 GB to over 1 TB (≈10× expansion), and the total execution time reached 47 minutes.

A New Approach

The proposed solution avoids data expansion by first tagging each record with a unique user‑level identifier (UID) and encoding the required dimension combos as compact numeric codes. This creates a converging data flow where the intermediate size does not increase with the number of dimensions.

Logical Implementation

Prepare detail data : collect order number, user ID, payment date, province, city, amount.

Generate Cube and encode rows : use DENSE_RANK to assign a cube_id for each dimension combination (province only, province+city, etc.).

Write Cube IDs back to detail records (MapJoin) and aggregate per UID to obtain cube_id_arry.

Aggregate by UID and deduplicate the cube_id_arry (e.g., ARRAY_DISTINCT).

Translate Cube IDs back to readable dimensions and compute final distinct counts.

Code Implementation

WITH base_dwd AS (SELECT pay_no, user_id, gmt_pay, pay_amt, prov_name, prov_code, city_name, city_code FROM tmp_user_pay_order_detail), dim_cube AS (SELECT *, DENSE_RANK() OVER (PARTITION BY 1 ORDER BY cube_prov_name, cube_city_name) AS cube_id FROM (SELECT dim_key, COALESCE(IF(GROUPING(prov_name)=0, prov_name, 'ALL'), 'na') AS cube_prov_name, COALESCE(IF(GROUPING(city_name)=0, city_name, 'ALL'), 'na') AS cube_city_name FROM (SELECT CONCAT(COALESCE(prov_name,''), '#', COALESCE(city_name,''), '#') AS dim_key, prov_name, city_name FROM base_dwd GROUP BY prov_name, city_name) base GROUP BY dim_key, prov_name, city_name GROUPING SETS ((dim_key, prov_name), (dim_key, prov_name, city_name))) , detail_ext AS (SELECT user_id, ARRAY_DISTINCT(SPLIT(WM_CONCAT(';', cube_ids), ';')) AS cube_id_arry FROM (SELECT user_id, cube_ids FROM (SELECT user_id, CONCAT(COALESCE(prov_name,''), '#', COALESCE(city_name,''), '#') AS dim_key FROM base_dwd) dwd_detail JOIN (SELECT dim_key, WM_CONCAT(';', cube_id) AS cube_ids FROM dim_cube GROUP BY dim_key) dim_cube ON dwd_detail.dim_key = dim_cube.dim_key) GROUP BY user_id), base_dws AS (SELECT cube_id, MAX(prov_name) AS prov_name, MAX(city_name) AS city_name, MAX(user_cnt) AS user_cnt FROM (SELECT cube_id, COUNT(1) AS user_cnt, CAST(NULL AS STRING) AS prov_name, CAST(NULL AS STRING) AS city_name FROM detail_ext LATERAL VIEW EXPLODE(cube_id_arry) arr AS cube_id GROUP BY cube_id UNION ALL SELECT CAST(cube_id AS STRING) AS cube_id, CAST(NULL AS BIGINT) AS user_cnt, cube_prov_name AS prov_name, cube_city_name AS city_name FROM dim_cube) GROUP BY cube_id) SELECT prov_name, city_name, user_cnt FROM base_dws;

Experimental Results

The new tagging pipeline processed 200 billion rows with 50 dimension combos in about 18 minutes, while the traditional expansion method would require >2.5 hours for the same workload (estimated 3 TB intermediate data). The new method therefore reduces runtime by more than 60% and scales linearly with data size.

How to Cut Data Cube Processing Time by 60% with Deduplication Optimization

Scenario Description

Common Implementation Methods

Performance Analysis

A New Approach

Logical Implementation

Code Implementation

Experimental Results

Other Solutions

Alibaba Cloud Developer

How this landed with the community

Was this worth your time?

0 Comments