Big Data 23 min read

SQL Optimization Techniques for ODPS (Open Data Processing Service)

The article presents practical ODPS SQL optimization strategies—including explicit column selection, partition limiting, multi‑insert, proper handling of nulls, join‑type choices, map‑join and skew hints, bucketed tables, and tuned task parameters—illustrated with three real‑world cases that dramatically cut execution time and resource usage.

DaTaobao Tech

Sep 6, 2022

SQL Optimization Techniques for ODPS (Open Data Processing Service)

Background

ODPS (Open Data Processing Service) is Alibaba's massive‑data processing platform built on the proprietary Feitian distributed operating system. It is a core component of the company's cloud‑computing solution. This article shares practical SQL‑level optimization experiences accumulated from years of data‑warehouse development on ODPS.

Basic Knowledge

Hive SQL execution flow can be referenced in the article "Hive SQL Compilation Process". The official Hive documentation also provides basic syntax guidance.

Practical Tips

When using null in conditions such as =, <>, IN, NOT IN, the value is excluded. Use nvl or coalesce to provide defaults.

Prefer explicit column lists over select * to reduce unnecessary data reads and avoid future schema‑change failures.

Use multi insert to write the result of a single scan into multiple target tables, e.g., from () tab insert overwrite A insert overwrite B .

Always limit partitions in queries (e.g., where ds='20230101' ) to avoid full‑table scans.

Apply limit in exploratory queries to reduce resource consumption.

Push UDF calls to the first sub‑query level for better performance.

Use collect_set or lateral view for row‑to‑column and column‑to‑row transformations.

Window functions such as row_number() or rank() over (partition by … order by …) can provide grouped ordering.

Choose the appropriate join type (left, inner, right, left anti, left semi) and ensure join keys have identical data types.

Cartesian product can be simulated by creating a small dimension table and joining it.

For data skew, consider mapjoin hint ( /*+mapjoin(b)*/ ) or split hot keys into a separate table.

When count(distinct …) causes skew, pre‑aggregate with group by then count.

Parameter Settings (ODPS)

Typical performance tuning involves adjusting the number and memory of map, join, and reduce tasks. Example settings:

Map: set odps.sql.mapper.cpu=100 , set odps.sql.mapper.memory=1024 , set odps.sql.mapper.split.size=256

Join: set odps.sql.joiner.instances=-1 , set odps.sql.joiner.cpu=100 , set odps.sql.joiner.memory=1024

Reduce: set odps.sql.reducer.instances=-1 , set odps.sql.reducer.cpu=100 , set odps.sql.reducer.memory=1024

Small‑file merge: set odps.merge.cross.paths=true , set odps.merge.smallfile.filesize.threshold=64 , set odps.merge.max.filenumber.per.instance=10000

UDF: set odps.sql.udf.jvm.memory=1024 , set odps.sql.udf.timeout=1800 , set odps.sql.udf.strict.mode=false

Mapjoin memory: set odps.sql.mapjoin.memory.max=512

Dynamic partition: set odps.sql.reshuffle.dynamicpt=true

Skew handling: set odps.sql.groupby.skewindata=true , set odps.sql.skewjoin=true

SQL Optimization Case 1 – Join & Data Skew

Background: A 90‑day order table (large) was left‑joined with product and SKU attribute tables. After expanding the query to all categories, the join stage slowed to 4 hours.

Steps:

Applied skewjoin hint – reduced time to ~2 hours but still not optimal.

Separated hot keys (top 500 k items) into a temporary table and used mapjoin for them – still ~2 hours.

Analyzed the execution plan and discovered that item_id and sku_id were implicitly cast to DOUBLE, causing precision loss for 18‑digit IDs.

Forced both sides to STRING type in the join condition – execution time dropped to ~40 minutes.

Summary:

Skew can be mitigated with skewjoin or hot‑key separation.

Ensure join keys share the same data type; cast to STRING when IDs exceed 15‑16 digits.

Store large IDs as STRING to avoid precision issues.

SQL Optimization Case 2 – Bucket Join for Large‑Large Tables

Background: DWS layer stores user‑item interaction aggregates. A query grouping by user, shop, and category was slow.

Solution:

SELECT cate_id, shop_type, user_id, SUM(itm_sty_tme) AS itm_sty_tme, … FROM dws_tm_brd_pwr_deep_usr_cat_1d WHERE ds = TO_CHAR(DATEADD(TO_DATE('${bizdate}', 'yyyymmdd'), -89, 'dd'), 'yyyymmdd') AND cate_flag='1' GROUP BY cate_id, shop_type, user_id;

Steps:

Create a test table cloned from the source.

Alter it to be clustered by user_id into 1024 buckets:

ALTER TABLE tmp_zhangtao_test_hash_range CLUSTERED BY (user_id) SORTED BY (user_id) INTO 1024 BUCKETS;

Insert data with INSERT OVERWRITE into the bucketed table.

Run the same aggregation against the bucketed table.

Result:

Bucketed version: CPU 0.34 Core·Min, Memory 0.61 GB·Min.

Non‑bucketed version: CPU 175.85 Core·Min, Memory 324.24 GB·Min.

Takeaway: Hash clustering dramatically reduces shuffle and resource consumption, but may introduce its own skew if the bucket key is unevenly distributed.

SQL Optimization Case 3 – Hash Clustering in Business Scenarios

Background: Same DWS table, need to aggregate 30‑day user behavior.

Approach:

Use hash clustering (bucketed tables) to co‑locate join keys.

When upstream tables cannot be altered, focus on rewriting SQL logic (e.g., multi‑insert, avoiding unnecessary scans).

Result: Reduced map‑phase data read and saved ~40 minutes of execution time.

Final Thoughts

SQL is a static language while business requirements evolve. Efficient SQL design that can cover multiple scenarios is essential for data engineers. Balancing performance gains with resource constraints requires a deep understanding of both the platform (ODPS) and the underlying data characteristics.

Team Introduction

We are the Big‑Juhuasuan Data Science team. Our mission is to make product‑ and mind‑operation efficient and predictable. We support platforms such as Juhuasuan, Billion‑Subsidy, and Daily Deals, focusing on discount‑driven shopping experiences through data insight.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Hive sql-optimization ODPS Data Skew parameter tuning

Written by

DaTaobao Tech

Official account of DaTaobao Technology

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Background

Basic Knowledge

Practical Tips

Parameter Settings (ODPS)

SQL Optimization Case 1 – Join & Data Skew

SQL Optimization Case 2 – Bucket Join for Large‑Large Tables

SQL Optimization Case 3 – Hash Clustering in Business Scenarios

Final Thoughts

Team Introduction

DaTaobao Tech

How this landed with the community

Was this worth your time?

0 Comments

SQL Optimization Case 1 – Join & Data Skew

SQL Optimization Case 2 – Bucket Join for Large‑Large Tables

SQL Optimization Case 3 – Hash Clustering in Business Scenarios