Understanding and Mitigating Data Skew in Spark and Hadoop
Data skew in Spark and Hadoop occurs when a few keys dominate shuffle traffic, causing slow tasks, OOM errors, and job failures; the article describes how to detect skew via UI metrics or sampling and offers mitigation tactics such as filtering keys, increasing shuffle partitions, custom partitioners, broadcast joins, salted keys, and Hadoop‑specific settings.
This article explains the concept of data skew in distributed big‑data systems such as Spark and Hadoop, describes its harms, symptoms, causes, and provides a comprehensive set of mitigation techniques.
What is data skew? In an ideal distributed environment, increasing the number of nodes should linearly reduce overall execution time. Data skew occurs when the workload is unevenly distributed, e.g., one node processes 80% of the data while the others handle only 10% each. The slowest task determines the stage duration, leading to poor parallelism.
Harms of data skew include excessive overall job runtime, memory‑overflow (OOM) failures, and even complete application crashes because overloaded tasks consume disproportionate resources.
Typical symptoms are:
Most tasks finish quickly, but a few run extremely slowly, causing the job to stall.
Sudden OOM errors in Spark jobs that previously ran fine.
Root cause : During shuffle operations (join, groupByKey, reduceByKey, etc.), keys with a very large number of associated records are sent to the same reducer, creating an imbalance.
Detection : Use Spark Web UI to inspect Shuffle Read Size/Records per task, or sample key frequencies:
df.select("key").sample(false, 0.1)
.map(k => (k, 1)).reduceByKey(_ + _)
.map(k => (k._2, k._1)).sortByKey(false)
.take(10)If a few keys dominate the distribution, skew is present.
Mitigation strategies (each described with pros and cons):
Filter abnormal keys : Remove or clean keys that cause extreme imbalance.
Increase shuffle parallelism : Adjust spark.sql.shuffle.partitions or RDD parallelism to spread keys across more tasks.
Custom Partitioner :
.groupByKey(new Partitioner() {
@Override
public int numPartitions() { return 12; }
@Override
public int getPartition(Object key) {
int id = Integer.parseInt(key.toString());
if (id >= 9500000 && id <= 9500084 && ((id - 9500000) % 12) == 0) {
return (id - 9500000) / 12;
} else {
return id % 12;
}
}
})Broadcast (Map‑side) Join to eliminate shuffle:
from pyspark.sql.functions import broadcast
result = broadcast(A).join(B, ["join_col"], "left")Split‑join‑union : Separate skewed keys, join them separately with a random prefix, then union with the normal join results.
Salted keys with local aggregation : Add a random prefix to keys before shuffle, perform a local reduce, then remove the prefix and do a global reduce.
def antiSkew(): RDD[(String, Int)] = {
val SPLIT = "-"
val prefix = new Random().nextInt(10)
pairs.map(t => (prefix + SPLIT + t._1, 1))
.reduceByKey(_ + _)
.map(t => (t._1.split(SPLIT)(1), t._2))
.reduceByKey(_ + _)
}Hadoop MapReduce also suffers from skew, typically manifesting as reducers stuck at 99.99% progress, OOM containers, and task kills. Common fixes include map‑side joins, converting count(distinct) to groupBy + count, enabling hive.map.aggr=true, and setting hive.groupby.skewindata=true for automatic load balancing.
Overall, the choice of mitigation depends on the data distribution, job size, and resource constraints. Simple parameter tuning is often tried first; if insufficient, more involved approaches like custom partitioners or salted joins are applied.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
vivo Internet Technology
Sharing practical vivo Internet technology insights and salon events, plus the latest industry news and hot conferences.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
