May 13, 2016 · Big Data

Spark Performance Optimization Guide: Data Skew and Shuffle Tuning

This advanced Spark performance guide explains how data skew arises during shuffles and presents eight practical solutions—including Hive preprocessing, key filtering, increased shuffle parallelism, two‑stage aggregation, map joins, sampling, random prefixes, and combined strategies—while also detailing key shuffle‑tuning parameters such as spark.shuffle.file.buffer, spark.reducer.maxSizeInFlight, and spark.shuffle.manager to improve memory usage and execution speed.

Big DataData SkewPerformance Optimization

0 likes · 33 min read

Spark Performance Optimization Guide: Data Skew and Shuffle Tuning