Big Data 9 min read

How to Diagnose and Optimize Data Skew and Data Expansion in Big Data SQL

This article shares practical methods, based on real‑world team experience, to identify and resolve data skew and data expansion issues in big data SQL queries, offering systematic investigation steps and optimization techniques for Map, Reduce, and Join stages.

Alibaba Cloud Developer

Jun 14, 2023

How to Diagnose and Optimize Data Skew and Data Expansion in Big Data SQL

Background

Various big‑data query engines such as Spark, Hive, and Presto provide friendly SQL syntax and are widely used; our organization also offers ODPS SQL for internal users. While using ODPS SQL to detect security risks, we discovered performance bottlenecks mainly caused by data skew and data expansion.

Problem Overview

Data Skew

Data skew occurs when many identical keys are sent to the same Reduce node, causing that node to process far more data than others and dramatically slowing overall SQL execution. The root cause is uneven key distribution, leading to unbalanced Reduce processing.

Data Expansion

Data expansion refers to situations where the output data volume is much larger than the input, e.g., 100 MB of input producing 1 TB of output, which can degrade performance and even cause resource exhaustion.

Investigation

Focus on business‑SQL execution issues, ignoring parameter tuning and platform failures. Steps: (1) Check input data volume for abnormal spikes; (2) Observe runtime of each execution stage and compare with typical values; (3) Identify tasks with unusually long runtimes or disproportionate data processing; (4) Locate the corresponding business logic in the code.

Optimization

Map‑side Optimizations

Merge small files during data read to reduce the number of Map tasks.

Column pruning: avoid SELECT * and filter unnecessary columns.

Predicate pushdown: apply filter conditions as close to the data source as possible.

Data redistribution: use DISTRIBUTE BY RAND() to randomize key distribution and prevent skew.

Reduce‑side Optimizations

Check for null or empty join keys; filter them early or replace with random values to avoid clustering of dirty data.

Sorting strategies: ORDER BY for global sorting (expensive on large tables), SORT BY for local sorting within a Reduce task, DISTRIBUTE BY for hash partitioning, and CLUSTER BY which combines distribution and sorting.

SELECT ta.id FROM ta LEFT JOIN tb ON coalesce(ta.id, rand()) = tb.id;

-- Original script
select * from user_pay_table where dt = '20221015' order by amt limit 500;
-- Improved script
SELECT * FROM user_pay_table WHERE dt = '20221015'
DISTRIBUTE BY (CASE WHEN amt < 100 THEN 0 WHEN amt >= 100 AND age <= 2000 THEN 1 ELSE 2 END)
SORT BY amt
LIMIT 500;

Join‑side Optimizations

Broadcast small tables to the Map side (map‑join) to avoid skewed key distribution.

For large‑large joins, whitelist hot keys, process them separately, and then merge results.

Avoid Cartesian products caused by incorrect join conditions.

Validate join‑key cardinality; low distinct counts increase the risk of data explosion.

Split heavy aggregation into multiple queries to prevent intermediate result blow‑up.

Conclusion

Big data SQL optimization requires a broad knowledge base, careful analysis of execution plans, and understanding of query‑engine design; the techniques presented address common issues and provide actionable solutions for practitioners.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Hive sql-optimization ODPS Data Skew data expansion

Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.