Understanding Data Skew in Big Data: Causes, Symptoms, and Solutions for Hadoop and Spark
This article explains what data skew is, why it occurs in large‑scale Hadoop and Spark jobs, how to recognize its symptoms such as stuck reducers or OOM executors, and presents practical strategies—including business‑level adjustments, code refactoring, and platform‑specific tuning—to mitigate the problem.
0x00 Introduction
Data skew is an unavoidable obstacle in the big‑data field; when the amount of data to be processed reaches hundreds of millions or even billions of records, data skew becomes a massive barrier. If you can overcome it, the sky is the limit; if not, you may spend weeks or months troubleshooting bizarre issues caused by data skew.
Disclaimer:
The topic is broad and technically demanding; the author does his best to share his understanding and welcomes discussion on any inaccuracies.
Some examples are not perfectly rigorous; minor details do not affect the overall comprehension of the article.
Article Structure
Briefly explain what data skew is.
Describe several scenarios that lead to data skew.
Analyze the causes of data skew in Hadoop and Spark.
Provide solutions (optimizations) for data skew.
0x01 What Is Data Skew
Simply put, data skew occurs when the data distribution is insufficient, causing a large amount of data to concentrate on one or a few machines, whose processing speed is far slower than the average, thus slowing down the entire computation.
Keyword: Data Skew
Most data engineers have encountered data skew, which can appear in various stages of data development, for example:
Hive reduce stage stuck at 99.99%.
Spark Streaming real‑time algorithms repeatedly cause executor OOM while other executors have low memory usage.
These problems often make us wait for hours for a job that never finishes.
Keyword: Hundred‑Billion Scale
Why emphasize such massive data volumes? The author shares his initial perception of data size and illustrates two contrasting company scenarios to show that the same operation (e.g., a join) can be trivial for a small dataset but catastrophic for a hundred‑billion‑record dataset.
0x02 What Data Skew Looks Like
The author describes typical symptoms observed in Hadoop and Spark.
1. Data Skew in Hadoop
In Hadoop, data skew mainly manifests as the reduce phase stuck at 99.99% and never completing.
One or a few reducers are stuck.
Containers report OOM.
Data read/write volume for the stuck reducers is far larger than for normal reducers.
These symptoms often lead to task kills and other strange behaviors.
2. Data Skew in Spark
Common Spark data‑skew symptoms include:
Executor lost, OOM, or shuffle errors.
Driver OOM.
One executor runs for an excessively long time, causing the whole job to stall.
Sudden failure of a normally running task.
In Spark Streaming, joins or group operations make skew even more likely because memory allocation is usually modest.
0x03 The Principle Behind Data Skew
1. Root Causes of Data Skew
Operations such as count(distinct), group by, and join trigger a shuffle; all records with the same key are sent to the same node, creating a hotspot.
2. The Evil Shuffle
Both Hadoop and Spark rely on shuffle to redistribute data. When the key distribution is uneven, a large portion of data ends up on a single node, leading to the bottleneck illustrated in the diagram.
3. Understanding Skew from a Data Perspective
Example tables (user and ip) illustrate how default values (e.g., null IP mapped to 0) can cause massive key concentration during joins.
4. Understanding Skew from a Business Perspective
Business scenarios directly affect data distribution; a sudden promotion in two cities can cause order counts to explode, making a group‑by operation highly skewed.
0x04 How to Solve Data Skew
Data skew can often be mitigated with platform‑independent methods such as better data preprocessing and outlier filtering.
1. General Strategies
Optimize business logic (e.g., compute hot cities separately and merge later).
Refactor code; for example, replace a single count(distinct) reduce with a two‑step approach: first group by, then count.
Tune platform parameters (Hadoop or Spark) to alleviate skew.
2. Data‑Level Solutions
When data distribution is uneven:
Lossy method: filter out abnormal data (e.g., IP = 0).
Lossless methods: compute skewed keys separately, add a hash layer to disperse keys, or perform data preprocessing.
3. Hadoop‑Specific Optimizations
Use map‑join.
Convert count(distinct) to a two‑step group‑then‑count.
Enable hive.groupby.skewindata=true.
Use left‑semi join.
Enable map‑side output and intermediate result compression.
4. Spark‑Specific Optimizations
Use map‑join.
Enable RDD compression.
Allocate sufficient driver memory.
Apply Spark SQL optimizations similar to Hive.
0xFF Summary
Data skew remains a significant challenge; handling it is a long‑term effort, and the ideas presented here aim to help practitioners mitigate its impact.
Further detailed topics such as Hive SQL tuning and data‑cleaning pitfalls will be covered in future articles.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
