Tagged articles

Shuffle Optimization

12 articles · Page 1 of 1

May 5, 2026 · Big Data

A New Data Lake Paradigm: Volcano Engine’s Multi‑Modal Data Lake Built on Lance

The article presents Volcano Engine’s AI‑focused data lake built on the Lance format, detailing why traditional lakes fall short for multimodal data, the engineering enhancements such as Binary Copy Compaction, Lance Insight, distributed vector indexing, JSON‑based tagging, Row‑ID shuffle optimization, and real‑world case studies that demonstrate significant performance and cost gains.

AIBinary Copy CompactionData Lake

0 likes · 18 min read

A New Data Lake Paradigm: Volcano Engine’s Multi‑Modal Data Lake Built on Lance

Alibaba Cloud Big Data AI Platform

Aug 19, 2025 · Big Data

Cut Shuffle Costs by 60% with MaxCompute’s Cluster Optimization Tool

MaxCompute’s new Cluster Optimization Recommendation analyzes 31 days of shuffle data to automatically suggest optimal hash clustering keys, dramatically cutting shuffle traffic and CU consumption for large jobs, while providing one‑click ALTER TABLE scripts and detailed benefit reports to boost big‑data processing efficiency.

Big DataHash ClusteringMaxCompute

0 likes · 8 min read

Cut Shuffle Costs by 60% with MaxCompute’s Cluster Optimization Tool

Big Data Technology & Architecture

Dec 31, 2024 · Big Data

Eliminating Shuffle in Spark Joins with Storage Partitioned Join (SPJ) for Iceberg Tables

This article explains how Spark ≥ 3.3 introduces Storage Partitioned Join (SPJ) to avoid costly shuffle operations when joining partitioned V2 source tables such as Apache Iceberg, detailing the required conditions, configuration settings, practical code examples, and various join scenarios including mismatched partitions and data skew.

BucketingData SkewSQL

0 likes · 15 min read

Eliminating Shuffle in Spark Joins with Storage Partitioned Join (SPJ) for Iceberg Tables

Past Memory Big Data

Dec 26, 2024 · Big Data

Eliminate Shuffle: Deep Dive into Spark’s Storage Partition Join (SPJ)

This article explains how Spark ≥ 3.3’s Storage Partition Join (SPJ) can avoid costly shuffle operations by using Iceberg tables, outlines the required table properties and Spark configurations, demonstrates the effect with code examples and execution plans, and explores several realistic join scenarios.

Apache IcebergBig DataSPJ

0 likes · 16 min read

Eliminate Shuffle: Deep Dive into Spark’s Storage Partition Join (SPJ)

Data Thinking Notes

Oct 27, 2022 · Big Data

Boost Spark Performance: Proven Code Optimizations & Tuning Tips

This article outlines practical Spark job optimization techniques—from code-level improvements and resource tuning to data skew handling, persistence strategies, shuffle reduction, broadcast variables, Kryo serialization, and efficient data structures—demonstrating how each can dramatically cut execution time.

Big DataKryo SerializationPerformance Tuning

0 likes · 19 min read

Boost Spark Performance: Proven Code Optimizations & Tuning Tips

DataFunTalk

Apr 28, 2021 · Big Data

Accelerating Apache Spark 3.0 with NVIDIA RAPIDS: Architecture, Performance Gains, and New Features

This article explains how NVIDIA's RAPIDS Accelerator leverages GPUs to speed up Apache Spark 3.0 workloads, detailing the underlying architecture, benchmark results on TPC‑DS and recommendation models, required configuration changes, supported operators, shuffle optimizations, and the enhancements introduced in versions 0.2 and 0.3.

Apache SparkBig DataGPU Acceleration

0 likes · 19 min read

Accelerating Apache Spark 3.0 with NVIDIA RAPIDS: Architecture, Performance Gains, and New Features

JD Tech

Feb 8, 2021 · Big Data

JD Remote Shuffle Service: Design, Implementation, and Performance Evaluation

This article presents JD's self‑developed Remote Shuffle Service for Spark, detailing its architecture, goals, implementation details, performance benchmarks, and real‑world production case studies that demonstrate its impact on shuffle efficiency and system stability in large‑scale data processing.

EtcdRemote Shuffle ServiceShuffle Optimization

0 likes · 17 min read

JD Remote Shuffle Service: Design, Implementation, and Performance Evaluation

JD Retail Technology

Jan 19, 2021 · Big Data

Design, Implementation, and Performance Evaluation of JD's Remote Shuffle Service for Spark

This article describes JD's research and production deployment of a self‑developed Remote Shuffle Service for Spark, covering its motivations, architectural design, cloud‑native features, monitoring, performance benchmarks against external shuffle solutions, and a real‑world promotion‑period case study that demonstrates improved stability and resource efficiency.

Cloud NativeRemote Shuffle ServiceShuffle Optimization

0 likes · 17 min read

Design, Implementation, and Performance Evaluation of JD's Remote Shuffle Service for Spark

DataFunTalk

Nov 13, 2019 · Big Data

ByteDance’s Core Optimization Practices on Spark SQL

ByteDance’s data warehouse team shares comprehensive optimizations for Spark SQL, covering architecture overview, bucket join enhancements, materialized columns and views, and shuffle stability and performance improvements, illustrating practical techniques that boost query efficiency and job reliability in large‑scale big‑data environments.

Big DataMaterialized ColumnsShuffle Optimization

0 likes · 20 min read

ByteDance’s Core Optimization Practices on Spark SQL

dbaplus Community

Aug 21, 2018 · Big Data

Master Spark Performance: Practical Development and Resource Tuning Guide

This article explains why Spark needs careful performance tuning, then details concrete development‑level optimizations (RDD reuse, persistence, shuffle avoidance, broadcast variables, Kryo serialization, data‑structure choices) and resource‑level settings (executor count, memory, cores, parallelism, memory fractions) with code examples and practical recommendations.

Broadcast VariablesKryo SerializationPerformance Tuning

0 likes · 32 min read

Master Spark Performance: Practical Development and Resource Tuning Guide

Liulishuo Tech Team

Jun 12, 2018 · Big Data

Highlights from Spark+AI Summit 2018: Hydrogen, MLflow, Delta, Spark 2.3, and Shuffle Optimization

The 2018 Spark+AI Summit in San Francisco showcased Spark's evolution toward unified AI and big‑data processing, introducing the Hydrogen project with gang scheduling, the open‑source MLflow platform, the Delta unified analytics engine, Spark 2.3 enhancements, and Facebook's shuffle I/O optimizations.

AIDelta LakeHydrogen

0 likes · 8 min read

Highlights from Spark+AI Summit 2018: Hydrogen, MLflow, Delta, Spark 2.3, and Shuffle Optimization

Architecture Digest

May 25, 2016 · Big Data

Advanced Spark Performance Optimization: Data Skew and Shuffle Tuning

This article provides a comprehensive guide on tackling Spark performance bottlenecks by diagnosing data skew, locating the offending stages and operators, and applying a range of practical solutions—including Hive pre‑processing, key filtering, shuffle parallelism, two‑stage aggregation, map‑join, and combined strategies—followed by an in‑depth discussion of shuffle manager evolution and key configuration parameters for fine‑tuning.

Big DataData SkewPerformance Tuning

0 likes · 35 min read

Advanced Spark Performance Optimization: Data Skew and Shuffle Tuning