How to Diagnose and Optimize Data Skew and Data Expansion in Big Data SQL
This article shares practical methods, based on real‑world team experience, to identify and resolve data skew and data expansion issues in big data SQL queries, offering systematic investigation steps and optimization techniques for Map, Reduce, and Join stages.
This article shares practical methods, based on real‑world team experience, to identify and resolve data skew and data expansion issues in big data SQL queries, offering systematic investigation steps and optimization techniques for Map, Reduce, and Join stages.
Background
Various big‑data query engines such as Spark, Hive, and Presto provide friendly SQL syntax and are widely used; our organization also offers ODPS SQL for internal users. While using ODPS SQL to detect security risks, we discovered performance bottlenecks mainly caused by data skew and data expansion.
Problem Overview
Data Skew
Data skew occurs when many identical keys are sent to the same Reduce node, causing that node to process far more data than others and dramatically slowing overall SQL execution. The root cause is uneven key distribution, leading to unbalanced Reduce processing.
Data Expansion
Data expansion refers to situations where the output data volume is much larger than the input, e.g., 100 MB of input producing 1 TB of output, which can degrade performance and even cause resource exhaustion.
Investigation
Focus on business‑SQL execution issues, ignoring parameter tuning and platform failures. Steps: (1) Check input data volume for abnormal spikes; (2) Observe runtime of each execution stage and compare with typical values; (3) Identify tasks with unusually long runtimes or disproportionate data processing; (4) Locate the corresponding business logic in the code.
Optimization
Map‑side Optimizations
Merge small files during data read to reduce the number of Map tasks.
Column pruning: avoid SELECT * and filter unnecessary columns.
Predicate pushdown: apply filter conditions as close to the data source as possible.
Data redistribution: use DISTRIBUTE BY RAND() to randomize key distribution and prevent skew.
Reduce‑side Optimizations
Check for null or empty join keys; filter them early or replace with random values to avoid clustering of dirty data.
Sorting strategies: ORDER BY for global sorting (expensive on large tables), SORT BY for local sorting within a Reduce task, DISTRIBUTE BY for hash partitioning, and CLUSTER BY which combines distribution and sorting.
SELECT ta.id FROM ta LEFT JOIN tb ON coalesce(ta.id, rand()) = tb.id; -- Original script
select * from user_pay_table where dt = '20221015' order by amt limit 500;
-- Improved script
SELECT * FROM user_pay_table WHERE dt = '20221015'
DISTRIBUTE BY (CASE WHEN amt < 100 THEN 0 WHEN amt >= 100 AND age <= 2000 THEN 1 ELSE 2 END)
SORT BY amt
LIMIT 500;Join‑side Optimizations
Broadcast small tables to the Map side (map‑join) to avoid skewed key distribution.
For large‑large joins, whitelist hot keys, process them separately, and then merge results.
Avoid Cartesian products caused by incorrect join conditions.
Validate join‑key cardinality; low distinct counts increase the risk of data explosion.
Split heavy aggregation into multiple queries to prevent intermediate result blow‑up.
Conclusion
Big data SQL optimization requires a broad knowledge base, careful analysis of execution plans, and understanding of query‑engine design; the techniques presented address common issues and provide actionable solutions for practitioners.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
