Understanding and Solving Data Skew in Hadoop and Spark
This article explains what data skew is, why it occurs in large‑scale Hadoop and Spark jobs, illustrates typical symptoms, and presents practical strategies—including business‑level adjustments, code tweaks, and platform‑specific tuning—to mitigate and resolve skew in big‑data processing.
0x00 Introduction
Data skew is a common bottleneck in big‑data processing; when billions of records are handled, uneven data distribution can cause a few machines to become overloaded, dramatically slowing the whole job.
When the problem is not addressed, it may take weeks of troubleshooting to resolve.
0x01 What Is Data Skew
Data skew occurs when the dispersion of data is insufficient, causing a large amount of data to be processed on one or a few nodes, whose processing speed is far below the average, leading to overall slowdown.
Typical scenarios include Hive reduce stages stuck at 99.99% and Spark Streaming executors OOM while other executors are idle.
0x02 Appearance of Data Skew
In Hadoop, skew often shows as reducers stuck at 99.99%, OOM errors in containers, and massive read/write volume on a single reducer.
In Spark, symptoms include executor loss, driver OOM, long‑running single executors, and sudden task failures, especially in streaming jobs that involve joins or group‑by operations.
0x03 Causes of Data Skew
Skew is usually triggered by operations such as count(distinct), group by, or join that cause a shuffle, concentrating many identical keys on one node.
Uneven data distribution, business‑driven hot keys (e.g., a few cities generating massive order volume), or default placeholder values (e.g., IP = 0) can also create skew.
0x04 How to Solve Data Skew
1. Business‑Level Strategies
Separate hot‑spot data (e.g., specific cities) and compute their metrics independently before merging with the rest.
2. Program‑Level Adjustments
Rewrite count(distinct) as a two‑step process: first group by the key, then count the groups.
3. Parameter Tuning
Both Hadoop and Spark provide configuration options to mitigate skew; proper tuning can resolve most issues.
4. Data‑Side Solutions
Filter or preprocess abnormal data (e.g., remove records with IP = 0), hash hot keys to increase parallelism, or compute skewed partitions separately.
Hadoop Optimization Methods
Use map‑side join.
Transform count(distinct) into a group by followed by count.
Enable hive.groupby.skewindata=true.
Leverage left‑semi join.
Compress map‑side output and intermediate results to reduce I/O.
Spark Optimization Methods
Use map‑side join.
Enable RDD compression.
Allocate sufficient driver memory.
Apply Spark SQL optimizations similar to Hive.
0xFF Summary
Data skew remains a significant challenge in large‑scale data processing; addressing it requires a combination of business insight, data preprocessing, code refactoring, and platform‑specific tuning. The techniques described here provide a solid starting point for mitigating skew in Hadoop and Spark workloads.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Architecture Digest
Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
