Unlock ODPS SQL Performance: Deep Dive into Execution Plans & Optimizations
This article examines ODPS SQL performance by dissecting logical execution plans and Logview visualizations, explaining the underlying principles of various optimization techniques such as multi‑distinct handling, shuffle reduction, system parameters, and different join strategies, and demonstrates how to apply these methods to improve query efficiency in real‑world data engineering tasks.
Background
When using ODPS SQL for offline data processing, developers often encounter performance issues and need to tune jobs to meet timeliness requirements and avoid resource waste. Existing documentation rarely explains the underlying reasons from the execution‑plan perspective.
ODPS Architecture
ODPS is divided into three layers: access layer, logical (control) layer, and storage/compute layer. The access layer uses LVS for load balancing and an HTTP server for authentication. The logical layer contains Worker, Scheduler, and Executor roles. The compute layer runs on independent clusters (Fuxi, Pangu, Nuwa/ZK, Shennong) and stores metadata in OTS.
Basic Concepts
One ODPS job corresponds to an instance, which contains tasks that may be split into multiple Fuxi jobs, each further divided into map, reduce, and joiner tasks.
Execution Plan
Using EXPLAIN can retrieve the logical plan, which helps check syntax, table/partition correctness, and map‑reduce structure.
Optimization Methods
Multi‑Distinct Optimization
Case 1: COUNT without DISTINCT – pre‑aggregation reduces shuffle.
EXPLAIN SELECT app_id, count(user_id) FROM xxx.table_vst_user_test WHERE dt = '20230816' GROUP BY app_id;Case 2: COUNT with DISTINCT – intermediate results contain user_id, causing larger shuffle and possible data skew.
EXPLAIN SELECT app_id, count(DISTINCT user_id) FROM xxx.table_vst_user_test WHERE dt = '20230816' GROUP BY app_id;Case 3: Multiple DISTINCT columns – further increases shuffle volume.
EXPLAIN SELECT app_id, count(DISTINCT user_id), count(DISTINCT cy23_source_name_l1), count(DISTINCT cy23_source_name_l2) FROM xxx.table_vst_user_test WHERE dt = '20230816' GROUP BY app_id;Case 4: Rewrite to aggregate before distinct, adding an extra reducer to change the hash key to app_id&user_id, reducing skew.
EXPLAIN SELECT app_id, COUNT(user_id) FROM (SELECT app_id, user_id FROM xxx.table_vst_user_test WHERE dt = '${bizdate}' GROUP BY app_id, user_id) t GROUP BY app_id;System Parameter odps.sql.groupby.skewindata
Setting this parameter to true changes the hash key similarly to Case 4, achieving the same effect.
SET odps.sql.groupby.skewindata = true; EXPLAIN SELECT app_id, COUNT(DISTINCT user_id) FROM xxx.table_vst_user_test WHERE dt = '${bizdate}' GROUP BY app_id;Join Strategies
Sort‑Merge Join : standard shuffle‑sort‑merge process, may suffer from data skew.
Broadcast Hash Join (mapjoin hint) : small dimension table is broadcast, avoiding shuffle of the large table.
EXPLAIN SELECT /*+mapjoin(t2)*/ mini_cat_name_l1, COUNT(DISTINCT user_id) FROM (SELECT app_id, user_id FROM xxx.table_vst_user_test WHERE dt = '20230816' GROUP BY app_id, user_id) t1 LEFT JOIN (SELECT app_id, mini_cat_name_l1 FROM xxx.dim_category WHERE dt = '20230816') t2 ON t1.app_id = t2.app_id GROUP BY mini_cat_name_l1;Distributed MapJoin : builds a distributed hash table for the small side; suitable when the large table is much bigger.
EXPLAIN SELECT /*+ DISTMAPJOIN(t2 (shard_count = 2)) */ mini_cat_name_l1, COUNT(DISTINCT user_id) FROM (SELECT app_id, user_id FROM xxx.table_vst_user_test WHERE dt = '20230816' GROUP BY app_id, user_id) t1 LEFT JOIN (SELECT app_id, mini_cat_name_l1 FROM xxx.dim_category WHERE dt = '20230816') t2 ON t1.app_id = t2.app_id GROUP BY mini_cat_name_l1;Summary
The article demonstrates how to analyze ODPS SQL logical and Logview execution plans to understand the rationale behind various optimizations, providing practical guidance for improving query performance in everyday data‑engineering tasks.
References
Sort‑Merge Join: https://www.sparkcodehub.com/spark-what-is-a-sort-merge-join-in-spark-sql
Join implementation: https://www.jianshu.com/p/97e76dddcbfb
SparkSQL join types: https://blog.csdn.net/wlk_328909605/article/details/82933552
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
