Architecture Digest
May 25, 2016 · Big Data
Advanced Spark Performance Optimization: Data Skew and Shuffle Tuning
This article provides a comprehensive guide on tackling Spark performance bottlenecks by diagnosing data skew, locating the offending stages and operators, and applying a range of practical solutions—including Hive pre‑processing, key filtering, shuffle parallelism, two‑stage aggregation, map‑join, and combined strategies—followed by an in‑depth discussion of shuffle manager evolution and key configuration parameters for fine‑tuning.
Data SkewPerformance TuningShuffle Optimization
0 likes · 35 min read