Big Data 10 min read

Optimizing Big Data SQL: Handling Data Skew and Data Explosion

This article examines common performance issues in big data SQL queries, such as data skew and data explosion, and provides systematic troubleshooting steps and practical optimization techniques across the Map, Reduce, and Join stages, including partition merging, column pruning, predicate pushdown, and join strategies.

Big Data Technology & Architecture

Jun 16, 2023

Optimizing Big Data SQL: Handling Data Skew and Data Explosion

Background

Many big‑data query engines (Spark, Hive, Presto, etc.) are widely used because of their friendly SQL syntax. In our team we use ODPS SQL to detect potential security risks, and we observed that most performance‑bottleneck queries suffer from data skew and data explosion.

This article focuses on execution‑level SQL optimizations (not parameter tuning) and is divided into three parts: the causes of data skew and data explosion, how to locate them, and systematic optimization ideas.

Problem Overview

Data Skew

Data skew occurs when a large number of identical keys are sent to the same Reduce node, causing that node to process far more data than others and become a performance bottleneck.

Data Explosion

Data explosion refers to cases where the output data volume is much larger than the input (e.g., 100 MB input becomes 1 TB output), leading to low efficiency or even task failure.

Investigation &定位

When a business SQL runs unusually long or fails, follow these steps:

Check the input data volume for abnormal spikes (e.g., during large‑scale promotions).

Observe the runtime of each stage after task splitting and compare with normal days.

Identify tasks (or specific Task instances) that take significantly longer or process much more data than average.

Locate the problematic business logic based on the identified code lines.

Optimization Strategies

Data Skew

1. Map‑side Optimizations

1.1 Merge small files: Reduce the number of Map tasks by combining many small files into larger ones.

1.2 Column pruning: Avoid SELECT * and filter unnecessary columns, adding partition predicates when possible.

1.3 Predicate push‑down: Push filter expressions close to the data source to reduce data transferred.

1.4 Data redistribution: Use distribute by rand() to randomize key distribution before Reduce.

2. Reduce‑side Optimizations

2.1 Null‑key handling: Filter out or replace null values in the Map stage; if needed later, convert nulls to random values to avoid hot keys.

2.2 Sorting optimization: Prefer distribute by + sort by instead of global order by to avoid full data shuffles.

3. Join‑side Optimizations

3.1 Broadcast small table: Load the small table into Map memory and perform the join on the Map side (MapJoin in Hive/ODPS, broadcast join in Spark).

3.2 Large‑table‑to‑large‑table join: Split hot keys into a whitelist, process them separately, and then merge results to avoid hotspot‑induced tail latency.

Data Explosion

Avoid Cartesian products caused by missing join conditions.

Validate join‑key cardinality: Low distinctness increases the risk of data blow‑up.

Prevent excessive aggregation: Split complex aggregations into multiple steps to limit intermediate data size.

Conclusion

Big‑data SQL optimization requires a broad knowledge base, including understanding of execution plans and the underlying query engine design. This article summarizes practical solutions for common issues, providing a reference for engineers facing performance problems in large‑scale SQL workloads.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Data Skew Distributed Computing Data Explosion

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.