Big Data 18 min read

Understanding and Mitigating Data Skew in Spark and Hadoop

Data skew in Spark and Hadoop occurs when a few keys dominate shuffle traffic, causing slow tasks, OOM errors, and job failures; the article describes how to detect skew via UI metrics or sampling and offers mitigation tactics such as filtering keys, increasing shuffle partitions, custom partitioners, broadcast joins, salted keys, and Hadoop‑specific settings.

vivo Internet Technology

Dec 25, 2019

Understanding and Mitigating Data Skew in Spark and Hadoop

This article explains the concept of data skew in distributed big‑data systems such as Spark and Hadoop, describes its harms, symptoms, causes, and provides a comprehensive set of mitigation techniques.

What is data skew? In an ideal distributed environment, increasing the number of nodes should linearly reduce overall execution time. Data skew occurs when the workload is unevenly distributed, e.g., one node processes 80% of the data while the others handle only 10% each. The slowest task determines the stage duration, leading to poor parallelism.

Harms of data skew include excessive overall job runtime, memory‑overflow (OOM) failures, and even complete application crashes because overloaded tasks consume disproportionate resources.

Typical symptoms are:

Most tasks finish quickly, but a few run extremely slowly, causing the job to stall.

Sudden OOM errors in Spark jobs that previously ran fine.

Root cause : During shuffle operations (join, groupByKey, reduceByKey, etc.), keys with a very large number of associated records are sent to the same reducer, creating an imbalance.

Detection : Use Spark Web UI to inspect Shuffle Read Size/Records per task, or sample key frequencies:

df.select("key").sample(false, 0.1)
    .map(k => (k, 1)).reduceByKey(_ + _)
    .map(k => (k._2, k._1)).sortByKey(false)
    .take(10)

If a few keys dominate the distribution, skew is present.

Mitigation strategies (each described with pros and cons):

Filter abnormal keys : Remove or clean keys that cause extreme imbalance.

Increase shuffle parallelism : Adjust spark.sql.shuffle.partitions or RDD parallelism to spread keys across more tasks.

Custom Partitioner :

.groupByKey(new Partitioner() {
  @Override
  public int numPartitions() { return 12; }
  @Override
  public int getPartition(Object key) {
    int id = Integer.parseInt(key.toString());
    if (id >= 9500000 && id <= 9500084 && ((id - 9500000) % 12) == 0) {
      return (id - 9500000) / 12;
    } else {
      return id % 12;
    }
  }
})

Broadcast (Map‑side) Join to eliminate shuffle:

from pyspark.sql.functions import broadcast
result = broadcast(A).join(B, ["join_col"], "left")

Split‑join‑union : Separate skewed keys, join them separately with a random prefix, then union with the normal join results.

Salted keys with local aggregation : Add a random prefix to keys before shuffle, perform a local reduce, then remove the prefix and do a global reduce.

def antiSkew(): RDD[(String, Int)] = {
  val SPLIT = "-"
  val prefix = new Random().nextInt(10)
  pairs.map(t => (prefix + SPLIT + t._1, 1))
       .reduceByKey(_ + _)
       .map(t => (t._1.split(SPLIT)(1), t._2))
       .reduceByKey(_ + _)
}

Hadoop MapReduce also suffers from skew, typically manifesting as reducers stuck at 99.99% progress, OOM containers, and task kills. Common fixes include map‑side joins, converting count(distinct) to groupBy + count, enabling hive.map.aggr=true, and setting hive.groupby.skewindata=true for automatic load balancing.

Overall, the choice of mitigation depends on the data distribution, job size, and resource constraints. Simple parameter tuning is often tried first; if insufficient, more involved approaches like custom partitioners or salted joins are applied.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Optimization Data Skew Spark Shuffle partitioning

Written by

vivo Internet Technology

Sharing practical vivo Internet technology insights and salon events, plus the latest industry news and hot conferences.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.