Big Data 11 min read

Understanding and Solving Data Skew in Hadoop and Spark

This article explains what data skew is, why it occurs in large‑scale Hadoop and Spark jobs, illustrates typical symptoms, and presents practical strategies—including business‑level adjustments, code tweaks, and platform‑specific tuning—to mitigate and resolve skew in big‑data processing.

Architecture Digest

Apr 24, 2017

Understanding and Solving Data Skew in Hadoop and Spark

0x00 Introduction

Data skew is a common bottleneck in big‑data processing; when billions of records are handled, uneven data distribution can cause a few machines to become overloaded, dramatically slowing the whole job.

When the problem is not addressed, it may take weeks of troubleshooting to resolve.

0x01 What Is Data Skew

Data skew occurs when the dispersion of data is insufficient, causing a large amount of data to be processed on one or a few nodes, whose processing speed is far below the average, leading to overall slowdown.

Typical scenarios include Hive reduce stages stuck at 99.99% and Spark Streaming executors OOM while other executors are idle.

0x02 Appearance of Data Skew

In Hadoop, skew often shows as reducers stuck at 99.99%, OOM errors in containers, and massive read/write volume on a single reducer.

In Spark, symptoms include executor loss, driver OOM, long‑running single executors, and sudden task failures, especially in streaming jobs that involve joins or group‑by operations.

0x03 Causes of Data Skew

Skew is usually triggered by operations such as count(distinct), group by, or join that cause a shuffle, concentrating many identical keys on one node.

Uneven data distribution, business‑driven hot keys (e.g., a few cities generating massive order volume), or default placeholder values (e.g., IP = 0) can also create skew.

0x04 How to Solve Data Skew

1. Business‑Level Strategies

Separate hot‑spot data (e.g., specific cities) and compute their metrics independently before merging with the rest.

2. Program‑Level Adjustments

Rewrite count(distinct) as a two‑step process: first group by the key, then count the groups.

3. Parameter Tuning

Both Hadoop and Spark provide configuration options to mitigate skew; proper tuning can resolve most issues.

4. Data‑Side Solutions

Filter or preprocess abnormal data (e.g., remove records with IP = 0), hash hot keys to increase parallelism, or compute skewed partitions separately.

Hadoop Optimization Methods

Use map‑side join.

Transform count(distinct) into a group by followed by count.

Enable hive.groupby.skewindata=true.

Leverage left‑semi join.

Compress map‑side output and intermediate results to reduce I/O.

Spark Optimization Methods

Use map‑side join.

Enable RDD compression.

Allocate sufficient driver memory.

Apply Spark SQL optimizations similar to Hive.

0xFF Summary

Data skew remains a significant challenge in large‑scale data processing; addressing it requires a combination of business insight, data preprocessing, code refactoring, and platform‑specific tuning. The techniques described here provide a solid starting point for mitigating skew in Hadoop and Spark workloads.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Data Skew Spark Hadoop

Written by

Architecture Digest

Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.