Big Data 17 min read

Mastering Efficient SQL in ODPS: Union, Count‑Distinct, and Join Optimizations

This article walks through common SQL development scenarios on ODPS, examining why naïve UNION and COUNT DISTINCT can be slow, how to rewrite queries with GROUP BY, UNION ALL, JSON aggregation, and map‑join techniques, and shows the resulting execution‑plan improvements with concrete code and performance numbers.

ITPUB

Mar 25, 2023

Mastering Efficient SQL in ODPS: Union, Count‑Distinct, and Join Optimizations

Union Efficiency

In many business cases data from multiple tables must be merged, such as combining customer information stored across two tables with overlapping records. A naïve UNION without deduplication first merges the datasets, causing a large shuffle and high intermediate I/O because the data volume does not shrink before the join.

SELECT cst_id, cst_info
FROM (
    SELECT cst_id, cst_info FROM @cst_info_a WHERE dt = '${bizdate}'
    UNION
    SELECT cst_id, cst_info FROM cst_info_b WHERE dt = '${bizdate}'
) cst_info;

The execution plan shows that ODPS has already optimized this pattern, producing the best possible plan. Two alternative high‑performance versions add a GROUP BY before the UNION (or use UNION ALL followed by a final GROUP BY) to reduce data early. Both alternatives generate identical execution plans, confirming that the extra aggregation does not hurt performance.

Count Distinct Optimization

Counting distinct users over the last five days is a typical exploratory query. The straightforward COUNT(DISTINCT cst_id) appears slow, but ODPS rewrites it into two deduplication steps internally. An alternative GROUP BY approach explicitly performs the deduplication and can be faster.

-- Common way
SELECT COUNT(DISTINCT cst_id) AS cst_cnt
FROM @pc_bill_bal
WHERE dt BETWEEN '${bizdate-5}' AND '${bizdate}';

-- Optimized way
SELECT COUNT(1) AS cst_cnt
FROM (
    SELECT cst_id FROM @pc_bill_bal
    WHERE dt BETWEEN '${bizdate-5}' AND '${bizdate}'
    GROUP BY cst_id
) base;

Execution‑plan screenshots reveal that the naïve version still performs two deduplication passes, while the optimized version adds an extra aggregation step but remains fast. In the test, the naïve query was 28 % faster (62 s vs 86 s) and used 28 % less resources.

When counting per day, the naïve COUNT(DISTINCT) incurs a massive shuffle (26× slower, 2× more resources). The GROUP BY rewrite dramatically improves runtime.

Multi‑Table Join Acceleration (Aggregation Type)

Joining several large tables to compute aggregated attributes (e.g., total assets across initial, retail, and buy tables) can be expensive. A traditional approach uses multiple FULL OUTER JOIN or sequential LEFT JOIN operations.

SELECT
    COALESCE(tt1.cst_id, tt2.cst_id) AS cst_id,
    COALESCE(tt1.bal_init_prin, 0) AS bal_init_prin,
    COALESCE(tt1.amt_retail_prin, 0) AS amt_retail_prin,
    COALESCE(tt2.amt_buy_prin, 0) AS amt_buy_prin
FROM (
    SELECT COALESCE(t1.cst_id, t2.cst_id) AS cst_id,
           COALESCE(t1.bal_init_prin, 0) AS bal_init_prin,
           COALESCE(t2.amt_retail_prin, 0) AS amt_retail_prin
    FROM @bal_init t1 FULL OUTER JOIN @amt_retail t2 ON t1.cst_id = t2.cst_id
) tt1 FULL OUTER JOIN @amt_buy tt2 ON tt1.cst_id = tt2.cst_id;

An optimized pattern first unions all rows with placeholder columns and then aggregates with a single GROUP BY. This reduces the number of shuffle and join stages.

SELECT cst_id,
       SUM(bal_init_prin) AS bal_init_prin,
       SUM(amt_retail_prin) AS amt_retail_prin,
       SUM(amt_buy_prin) AS amt_buy_prin
FROM (
    SELECT cst_id, bal_init_prin, 0 AS amt_retail_prin, 0 AS amt_buy_prin FROM @bal_init
    UNION ALL
    SELECT cst_id, 0, amt_retail_prin, 0 FROM @amt_retail
    UNION ALL
    SELECT cst_id, 0, 0, amt_buy_prin FROM @amt_buy
) t1
GROUP BY cst_id;

The execution plan shows the optimized version uses far fewer MAP and REDUCE tasks, cutting runtime by ~20 % and resource usage by ~30 % as the number of tables grows.

Multi‑Table Join Acceleration (String Type)

When the attributes are strings, the same union‑and‑group technique can be applied using JSON aggregation. Each table contributes a JSON fragment; COLLECT_SET gathers fragments per key, which are then parsed with GET_JSON_OBJECT.

SELECT cst_id,
       GET_JSON_OBJECT(all_val, '$.bal_init_prin') AS bal_init_prin,
       GET_JSON_OBJECT(all_val, '$.amt_retail_prin') AS amt_retail_prin,
       GET_JSON_OBJECT(all_val, '$.amt_buy_prin') AS amt_buy_prin
FROM (
    SELECT cst_id,
           CONCAT('{', CONCAT_WS(',', COLLECT_SET(all_val)), '}') AS all_val
    FROM (
        SELECT cst_id, CONCAT('"bal_init_prin":"', bal_init_prin, '"') AS all_val FROM @bal_init
        UNION ALL
        SELECT cst_id, CONCAT('"amt_retail_prin":"', amt_retail_prin, '"') FROM @amt_retail
        UNION ALL
        SELECT cst_id, CONCAT('"amt_buy_prin":"', amt_buy_prin, '"') FROM @amt_buy
    ) t1
    GROUP BY cst_id
) tt1;

Execution‑plan comparison confirms the JSON‑based method eliminates repeated joins, delivering similar speed gains while handling string‑type attributes.

MapJoin and DistMapJoin

For a large‑table‑small‑table join, ODPS can automatically apply a map‑join (broadcast join) without explicit hints. The join runs entirely in the MAP phase, removing the separate JOIN task. However, map‑join fails if the small table exceeds the configured memory limit.

-- MapJoin works (inner join)
SELECT base.*, fee_year_rate.*
FROM @base base
INNER JOIN @fee_year_rate fee_year_rate ON base.terms = fee_year_rate.terms;

-- MapJoin fails (right join, full outer join)
SELECT base.*, fee_year_rate.*
FROM @base base RIGHT JOIN @fee_year_rate fee_year_rate ON base.terms = fee_year_rate.terms;

Execution‑plan screenshots show map‑join producing only MAP tasks, while the failing cases retain a JOIN stage. The memory parameter can be tuned, e.g., set odps.sql.mapjoin.memory.max=2048 (MB).

DistMapJoin extends map‑join to medium‑sized tables by sharding the small table across reducers. The hint syntax is /*+distmapjoin(table_name(shard_count=n,replica_count=m))*/. The plan adds a single REDUCE task for sharding but still avoids a full JOIN stage.

SELECT /*+distmapjoin(cst_info(shard_count=20))*/ base.*, cst_info.*
FROM @base base LEFT JOIN @cst_info cst_info ON base.cst_id = cst_info.cst_id;

Predicate Placement

Developers often place filter conditions immediately after reading a table, assuming it improves performance. ODPS pushes predicates early in the execution plan regardless of where the WHERE clause appears. Both the “standard” (filter before join) and “non‑standard” (filter after join) queries generate identical plans.

-- Standard: filter before join
SELECT base.*, fee_year_rate.*
FROM (SELECT * FROM @base WHERE terms = '12') base
INNER JOIN @fee_year_rate fee_year_rate ON base.terms = fee_year_rate.terms;

-- Non‑standard: filter after join
SELECT base.*, fee_year_rate.*
FROM @base base
INNER JOIN @fee_year_rate fee_year_rate ON base.terms = fee_year_rate.terms
WHERE base.terms = '12';

ODPS optimizes predicate push‑down effectively, but keeping filters close to the source table improves code readability.

Overall Takeaway

Effective SQL development on ODPS requires reading and interpreting execution plans, experimenting with UNION ALL + GROUP BY, map‑join, and dist‑map‑join patterns, and understanding how predicate placement is optimized. Mastery of these techniques leads to faster queries and lower resource consumption.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data SQL Query Optimization ODPS UNION MapJoin CountDistinct

Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Union Efficiency

Count Distinct Optimization

Multi‑Table Join Acceleration (Aggregation Type)

Multi‑Table Join Acceleration (String Type)

MapJoin and DistMapJoin

Predicate Placement

Overall Takeaway

ITPUB

How this landed with the community

Was this worth your time?

0 Comments

Count Distinct Optimization