Big Data 17 min read

Unlock ODPS SQL Performance: Deep Dive into Execution Plans & Optimizations

This article examines ODPS SQL performance by dissecting logical execution plans and Logview visualizations, explaining the underlying principles of various optimization techniques such as multi‑distinct handling, shuffle reduction, system parameters, and different join strategies, and demonstrates how to apply these methods to improve query efficiency in real‑world data engineering tasks.

Alibaba Cloud Developer

Jan 11, 2024

Unlock ODPS SQL Performance: Deep Dive into Execution Plans & Optimizations

Background

When using ODPS SQL for offline data processing, developers often encounter performance issues and need to tune jobs to meet timeliness requirements and avoid resource waste. Existing documentation rarely explains the underlying reasons from the execution‑plan perspective.

ODPS Architecture

ODPS is divided into three layers: access layer, logical (control) layer, and storage/compute layer. The access layer uses LVS for load balancing and an HTTP server for authentication. The logical layer contains Worker, Scheduler, and Executor roles. The compute layer runs on independent clusters (Fuxi, Pangu, Nuwa/ZK, Shennong) and stores metadata in OTS.

Basic Concepts

One ODPS job corresponds to an instance, which contains tasks that may be split into multiple Fuxi jobs, each further divided into map, reduce, and joiner tasks.

Execution Plan

Using EXPLAIN can retrieve the logical plan, which helps check syntax, table/partition correctness, and map‑reduce structure.

Optimization Methods

Multi‑Distinct Optimization

Case 1: COUNT without DISTINCT – pre‑aggregation reduces shuffle.

EXPLAIN SELECT app_id, count(user_id) FROM xxx.table_vst_user_test WHERE dt = '20230816' GROUP BY app_id;

Case 2: COUNT with DISTINCT – intermediate results contain user_id, causing larger shuffle and possible data skew.

EXPLAIN SELECT app_id, count(DISTINCT user_id) FROM xxx.table_vst_user_test WHERE dt = '20230816' GROUP BY app_id;

Case 3: Multiple DISTINCT columns – further increases shuffle volume.

EXPLAIN SELECT app_id, count(DISTINCT user_id), count(DISTINCT cy23_source_name_l1), count(DISTINCT cy23_source_name_l2) FROM xxx.table_vst_user_test WHERE dt = '20230816' GROUP BY app_id;

Case 4: Rewrite to aggregate before distinct, adding an extra reducer to change the hash key to app_id&user_id, reducing skew.

EXPLAIN SELECT app_id, COUNT(user_id) FROM (SELECT app_id, user_id FROM xxx.table_vst_user_test WHERE dt = '${bizdate}' GROUP BY app_id, user_id) t GROUP BY app_id;

System Parameter odps.sql.groupby.skewindata

Setting this parameter to true changes the hash key similarly to Case 4, achieving the same effect.

SET odps.sql.groupby.skewindata = true;

EXPLAIN SELECT app_id, COUNT(DISTINCT user_id) FROM xxx.table_vst_user_test WHERE dt = '${bizdate}' GROUP BY app_id;

Join Strategies

Sort‑Merge Join : standard shuffle‑sort‑merge process, may suffer from data skew.

Broadcast Hash Join (mapjoin hint) : small dimension table is broadcast, avoiding shuffle of the large table.

EXPLAIN SELECT /*+mapjoin(t2)*/ mini_cat_name_l1, COUNT(DISTINCT user_id) FROM (SELECT app_id, user_id FROM xxx.table_vst_user_test WHERE dt = '20230816' GROUP BY app_id, user_id) t1 LEFT JOIN (SELECT app_id, mini_cat_name_l1 FROM xxx.dim_category WHERE dt = '20230816') t2 ON t1.app_id = t2.app_id GROUP BY mini_cat_name_l1;

Distributed MapJoin : builds a distributed hash table for the small side; suitable when the large table is much bigger.

EXPLAIN SELECT /*+ DISTMAPJOIN(t2 (shard_count = 2)) */ mini_cat_name_l1, COUNT(DISTINCT user_id) FROM (SELECT app_id, user_id FROM xxx.table_vst_user_test WHERE dt = '20230816' GROUP BY app_id, user_id) t1 LEFT JOIN (SELECT app_id, mini_cat_name_l1 FROM xxx.dim_category WHERE dt = '20230816') t2 ON t1.app_id = t2.app_id GROUP BY mini_cat_name_l1;

Summary

The article demonstrates how to analyze ODPS SQL logical and Logview execution plans to understand the rationale behind various optimizations, providing practical guidance for improving query performance in everyday data‑engineering tasks.

References

Sort‑Merge Join: https://www.sparkcodehub.com/spark-what-is-a-sort-merge-join-in-spark-sql

Join implementation: https://www.jianshu.com/p/97e76dddcbfb

SparkSQL join types: https://blog.csdn.net/wlk_328909605/article/details/82933552

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

ODPS Shuffle execution plan join strategies

Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.