Big Data 18 min read

How to Tackle Spark Data Skew: Practical Solutions and Real‑World Examples

This article explains what Spark data skew is, why it hurts performance, and presents six practical mitigation techniques—including adjusting parallelism, custom partitioners, map‑side joins, and adding random prefixes—backed by detailed experiments, code snippets, and performance comparisons.

dbaplus Community
dbaplus Community
dbaplus Community
How to Tackle Spark Data Skew: Practical Solutions and Real‑World Examples

Author: Guo Jun, big‑data architect familiar with Kafka, Flume, Hadoop, Spark, data‑warehouse modeling and SQL tuning. Blog: http://www.jasongj.com/.

Why Handle Data Skew

In large‑scale data systems such as Spark or Hadoop, the volume of data is not the main problem; data skew can become a serious bottleneck. Data skew occurs when one partition (e.g., a Spark or Kafka partition) contains far more records than others, causing the task processing that partition to dominate the overall job duration.

How to Mitigate / Eliminate Data Skew

1. Avoid Skew at the Data Source

When reading Kafka via DirectStream, each Kafka partition maps to a Spark task. If the Kafka producer uses a random partitioner, data is evenly distributed, preventing skew. However, business requirements may force ordering by a key (e.g., user‑level PV), which can re‑introduce skew.

2. Adjust Parallelism to Disperse Keys

Shuffle in Spark uses HashPartitioner by default. If the parallelism is too low, many distinct keys may be assigned to the same task, creating skew. Increasing the shuffle parallelism spreads those keys across more tasks, reducing the data volume per task.

Example experiment:

Using a test table student_external with 1.05 billion rows, we selected 150 million rows (id 9 × 10⁸–10.5 × 10⁸) and artificially skewed a subset so that one task processed 45 million rows while others processed only 5 million. With groupByKey(12) the slowest task took 38 s (9× longer than others). Raising the shuffle parallelism to 48 reduced the maximum task record count to ~11.25 million and the duration to 24 s. Decreasing parallelism to 11 also improved balance, showing that the optimal parallelism depends on the data distribution.

spark-submit --queue ambari --num-executors 4 --executor-cores 12 --executor-memory 12g --class com.jasongj.spark.driver.SparkDataSkew --master yarn --deploy-mode client SparkExample-with-dependencies-1.0.jar

3. Use a Custom Partitioner

Replacing the default HashPartitioner with a custom implementation can explicitly control key distribution. In the same test scenario, setting the parallelism to 12 and applying a custom partitioner resulted in the longest task processing ~10 million rows in 15 s, with all tasks handling comparable data sizes.

4. Convert Reduce‑Side Join to Map‑Side Join (Broadcast Join)

Broadcasting a small table eliminates the shuffle required for a reduce‑side join. In the example, joining a large table (150 million rows) with a small table (500 k rows) using a regular join produced a three‑stage DAG with severe skew in the join stage (slowest task 7.1 min). Setting spark.sql.autoBroadcastJoinThreshold sufficiently high and using a broadcast join reduced the DAG to a single stage, with total execution time dropping from 7.3 min to 1.5 min and no observable skew.

5. Add Random Prefix/Suffix to Skewed Keys

Appending a random prefix (or suffix) to a heavily‑skewed key transforms identical keys into distinct ones, allowing them to be processed by different tasks. After the join, the prefixes are removed to restore the original keys. In the test, two skewed keys (9500048 and 9500096) were each expanded to 4 million rows, while other keys had only 100 rows each. Adding a random prefix of 1–24 and joining with parallelism 48 eliminated the skew, reducing the join stage from 1.7 min to 33 s.

6. Random Prefix on Large Table and Expand Small Table N‑fold

When many keys are skewed, extracting each individually is impractical. Instead, add a random prefix to every record of the large, skewed table and replicate the small table N times (Cartesian product with the prefix set). This spreads the skewed keys across N tasks. The approach works well when the small table is uniformly distributed.

Summary

There is no single universal solution for data skew. Effective mitigation usually requires a combination of techniques based on the characteristics of the data set—such as the number of skewed keys, the size of each table, and the overall job topology. Adjusting shuffle parallelism, using custom partitioners, applying broadcast joins, and adding random prefixes are practical tools that can significantly reduce or eliminate skew and improve Spark job performance.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Data SkewSparkShufflePartitionerMap-side Join
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.