Big Data 31 min read

Data Skew Optimization Techniques in Spark

This article explains the phenomenon, causes, detection methods, and a comprehensive set of solutions—including Hive preprocessing, key filtering, shuffle parallelism, two‑stage aggregation, map‑join, sampling, random prefixing, and combined strategies—to mitigate data skew in Spark jobs and improve performance.

Big Data Technology & Architecture

May 30, 2019

Data Skew Optimization Techniques in Spark

1. Overview of Optimization

Sometimes Spark jobs suffer from data skew, where a few tasks run extremely slowly while most finish quickly, leading to poor performance or OOM errors.

1.1 Symptoms of Data Skew

Most tasks finish within a minute, but a few take hours.

A job that previously ran fine suddenly throws an OOM exception caused by business code.

1.2 Why Data Skew Happens

During shuffle, Spark pulls all records with the same key to a single task for aggregation or join. If a particular key has a massive amount of data (e.g., 1 000 000 rows) while others have only a few, the task handling that key becomes a bottleneck, slowing the whole stage and possibly causing memory overflow.

The overall job progress is determined by the slowest task.

For example, the key "hello" may have 7 records spread across three nodes, all of which are shuffled to one task, making it 7 × slower than tasks handling "world" or "you" with a single record.

1.3 Locating Skewed Code

Data skew only occurs during shuffle. Common shuffle operators include distinct, groupByKey, reduceByKey, aggregateByKey, join, cogroup, repartition. If a job is slow, check which stage contains a shuffle operator.

Task Execution Too Slow

Identify the stage where the slowdown occurs. In yarn-client mode, logs show the current stage; in yarn-cluster mode, use Spark Web UI. The UI displays each task’s runtime and data size, helping confirm skew.

After locating the skewed stage, map it back to the code. The stage boundary is usually a shuffle operator (e.g., reduceByKey in the classic word‑count example).

val conf = new SparkConf()
val sc = new SparkContext(conf)

val lines = sc.textFile("hdfs://...")
val words = lines.flatMap(_.split(" "))
val pairs = words.map((_, 1))
val wordCounts = pairs.reduceByKey(_ + _)

wordCounts.collect().foreach(println(_))

This example shows stage0 (reading, mapping, shuffle write) and stage1 (shuffle read, reduceByKey, collect).

Task Unexpected OOM

When OOM occurs, examine the driver or YARN logs to find the offending line, which often contains a shuffle operator. However, do not assume OOM always means skew; verify with UI metrics.

1.4 Inspecting Key Distribution

After locating the skewed RDD/Hive table, examine key distribution to decide the appropriate mitigation technique. For Spark SQL, query the table; for RDDs, use countByKey() on a sample.

val sampledPairs = pairs.sample(false, 0.1)
val sampledWordCounts = sampledPairs.countByKey()
sampledWordCounts.foreach(println(_))

2. Solutions to Data Skew

2.1 Hive ETL Pre‑processing

When to use: The source data resides in a Hive table with highly uneven key distribution and the job frequently accesses it.

Approach: Pre‑aggregate or join data in Hive, producing a balanced intermediate table that Spark can read without shuffle operators.

Pros: Simple, eliminates Spark‑side skew.

Cons: Hive ETL itself may still experience skew; it merely moves the problem upstream.

2.2 Filter Out Skewed Keys

When to use: A few keys dominate the data volume and their removal does not affect business results.

Approach: Use where in Spark SQL or filter on RDDs to drop those keys. Dynamically detect heavy keys via sampling and filter them.

2.3 Increase Shuffle Parallelism

When to use: General skew mitigation; increase the number of shuffle partitions.

Approach: Specify a larger partition count, e.g., reduceByKey(1000) or set spark.sql.shuffle.partitions (default 200) to a higher value.

2.4 Two‑Stage Aggregation (Partial + Global)

When to use: Aggregation‑type shuffle operations such as reduceByKey or group by.

Approach: Add a random prefix to keys, perform a local aggregation, strip the prefix, then do a global aggregation.

// Step 1: add random prefix
JavaPairRDD<String, Long> randomPrefixRdd = rdd.mapToPair(...);
// Step 2: local aggregation
JavaPairRDD<String, Long> localAggrRdd = randomPrefixRdd.reduceByKey(...);
// Step 3: remove prefix
JavaPairRDD<Long, Long> removedRandomPrefixRdd = localAggrRdd.mapToPair(...);
// Step 4: global aggregation
JavaPairRDD<Long, Long> globalAggrRdd = removedRandomPrefixRdd.reduceByKey(...);

2.5 Convert Reduce‑Join to Map‑Join

When to use: One side of the join is small enough to broadcast.

Approach: Collect the small RDD, broadcast it, and perform a map‑side join, completely avoiding shuffle.

// Broadcast small RDD
final Broadcast<List<Tuple2<Long, Row>>> rdd1DataBroadcast = sc.broadcast(rdd1Data);
// Map‑side join
JavaPairRDD<String, Tuple2<String, Row>> joinedRdd = rdd2.mapToPair(...);

2.6 Sample Skewed Keys and Split Join

When to use: One side has a few extremely heavy keys while the other side is uniformly distributed.

Approach: Sample the heavy‑key RDD, identify top skewed keys, split them into a separate RDD, add random prefixes, and join with an expanded version of the other RDD.

// Identify skewed key
final Long skewedUserid = reversedSampledRDD.sortByKey(false).take(1).get(0)._2;
// Split and prefix
JavaPairRDD<Long, String> skewedRDD = rdd1.filter(...);
JavaPairRDD<Long, String> commonRDD = rdd1.filter(...);
// Expand the other side
JavaPairRDD<String, Row> skewedRdd2 = rdd2.filter(...).flatMapToPair(...);
// Join separately and union results
JavaPairRDD<Long, Tuple2<String, Row>> joinedRDD1 = skewedRDD.mapToPair(...).join(...);
JavaPairRDD<Long, Tuple2<String, Row>> joinedRDD2 = commonRDD.join(rdd2);
JavaPairRDD<Long, Tuple2<String, Row>> joinedRDD = joinedRDD1.union(joinedRDD2);

2.7 Random Prefix + Expansion for Join

When to use: Both sides contain many skewed keys; a full‑scale expansion is required.

Approach: Expand the uniformly distributed RDD by replicating each record N times, add a random prefix to the skewed‑key RDD, then join.

// Expand uniformly distributed RDD
JavaPairRDD<String, Row> expandedRDD = rdd1.flatMapToPair(...);
// Prefix skewed RDD
JavaPairRDD<String, String> mappedRDD = rdd2.mapToPair(...);
// Join
JavaPairRDD<String, Tuple2<String, Row>> joinedRDD = mappedRDD.join(expandedRDD);

2.8 Combining Multiple Strategies

Complex skew scenarios often require a combination of the above methods: pre‑process with Hive, filter keys, increase shuffle parallelism, and apply two‑stage aggregation or map‑join as needed.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Data Skew Spark Shuffle

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.