Big Data Technology & Architecture
Big Data Technology & Architecture
Dec 31, 2024 · Big Data

Eliminating Shuffle in Spark Joins with Storage Partitioned Join (SPJ) for Iceberg Tables

This article explains how Spark ≥ 3.3 introduces Storage Partitioned Join (SPJ) to avoid costly shuffle operations when joining partitioned V2 source tables such as Apache Iceberg, detailing the required conditions, configuration settings, practical code examples, and various join scenarios including mismatched partitions and data skew.

BucketingData SkewSQL
0 likes · 15 min read
Eliminating Shuffle in Spark Joins with Storage Partitioned Join (SPJ) for Iceberg Tables