How to Tackle Spark Data Skew: Practical Solutions and Real‑World Examples
This article explains what Spark data skew is, why it hurts performance, and presents six practical mitigation techniques—including adjusting parallelism, custom partitioners, map‑side joins, and adding random prefixes—backed by detailed experiments, code snippets, and performance comparisons.
Author: Guo Jun, big‑data architect familiar with Kafka, Flume, Hadoop, Spark, data‑warehouse modeling and SQL tuning. Blog: http://www.jasongj.com/.
Why Handle Data Skew
In large‑scale data systems such as Spark or Hadoop, the volume of data is not the main problem; data skew can become a serious bottleneck. Data skew occurs when one partition (e.g., a Spark or Kafka partition) contains far more records than others, causing the task processing that partition to dominate the overall job duration.
How to Mitigate / Eliminate Data Skew
1. Avoid Skew at the Data Source
When reading Kafka via DirectStream, each Kafka partition maps to a Spark task. If the Kafka producer uses a random partitioner, data is evenly distributed, preventing skew. However, business requirements may force ordering by a key (e.g., user‑level PV), which can re‑introduce skew.
2. Adjust Parallelism to Disperse Keys
Shuffle in Spark uses HashPartitioner by default. If the parallelism is too low, many distinct keys may be assigned to the same task, creating skew. Increasing the shuffle parallelism spreads those keys across more tasks, reducing the data volume per task.
Example experiment:
Using a test table student_external with 1.05 billion rows, we selected 150 million rows (id 9 × 10⁸–10.5 × 10⁸) and artificially skewed a subset so that one task processed 45 million rows while others processed only 5 million. With groupByKey(12) the slowest task took 38 s (9× longer than others). Raising the shuffle parallelism to 48 reduced the maximum task record count to ~11.25 million and the duration to 24 s. Decreasing parallelism to 11 also improved balance, showing that the optimal parallelism depends on the data distribution.
spark-submit --queue ambari --num-executors 4 --executor-cores 12 --executor-memory 12g --class com.jasongj.spark.driver.SparkDataSkew --master yarn --deploy-mode client SparkExample-with-dependencies-1.0.jar3. Use a Custom Partitioner
Replacing the default HashPartitioner with a custom implementation can explicitly control key distribution. In the same test scenario, setting the parallelism to 12 and applying a custom partitioner resulted in the longest task processing ~10 million rows in 15 s, with all tasks handling comparable data sizes.
4. Convert Reduce‑Side Join to Map‑Side Join (Broadcast Join)
Broadcasting a small table eliminates the shuffle required for a reduce‑side join. In the example, joining a large table (150 million rows) with a small table (500 k rows) using a regular join produced a three‑stage DAG with severe skew in the join stage (slowest task 7.1 min). Setting spark.sql.autoBroadcastJoinThreshold sufficiently high and using a broadcast join reduced the DAG to a single stage, with total execution time dropping from 7.3 min to 1.5 min and no observable skew.
5. Add Random Prefix/Suffix to Skewed Keys
Appending a random prefix (or suffix) to a heavily‑skewed key transforms identical keys into distinct ones, allowing them to be processed by different tasks. After the join, the prefixes are removed to restore the original keys. In the test, two skewed keys (9500048 and 9500096) were each expanded to 4 million rows, while other keys had only 100 rows each. Adding a random prefix of 1–24 and joining with parallelism 48 eliminated the skew, reducing the join stage from 1.7 min to 33 s.
6. Random Prefix on Large Table and Expand Small Table N‑fold
When many keys are skewed, extracting each individually is impractical. Instead, add a random prefix to every record of the large, skewed table and replicate the small table N times (Cartesian product with the prefix set). This spreads the skewed keys across N tasks. The approach works well when the small table is uniformly distributed.
Summary
There is no single universal solution for data skew. Effective mitigation usually requires a combination of techniques based on the characteristics of the data set—such as the number of skewed keys, the size of each table, and the overall job topology. Adjusting shuffle parallelism, using custom partitioners, applying broadcast joins, and adding random prefixes are practical tools that can significantly reduce or eliminate skew and improve Spark job performance.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
