Tagged articles
11 articles
Page 1 of 1
DataFunSummit
DataFunSummit
May 5, 2026 · Big Data

A New Data Lake Paradigm: Volcano Engine’s Multi‑Modal Data Lake Built on Lance

The article presents Volcano Engine’s AI‑focused data lake built on the Lance format, detailing why traditional lakes fall short for multimodal data, the engineering enhancements such as Binary Copy Compaction, Lance Insight, distributed vector indexing, JSON‑based tagging, Row‑ID shuffle optimization, and real‑world case studies that demonstrate significant performance and cost gains.

AIBinary Copy CompactionData Lake
0 likes · 18 min read
A New Data Lake Paradigm: Volcano Engine’s Multi‑Modal Data Lake Built on Lance
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Aug 19, 2025 · Big Data

Cut Shuffle Costs by 60% with MaxCompute’s Cluster Optimization Tool

MaxCompute’s new Cluster Optimization Recommendation analyzes 31 days of shuffle data to automatically suggest optimal hash clustering keys, dramatically cutting shuffle traffic and CU consumption for large jobs, while providing one‑click ALTER TABLE scripts and detailed benefit reports to boost big‑data processing efficiency.

Big DataCost reductionHash Clustering
0 likes · 8 min read
Cut Shuffle Costs by 60% with MaxCompute’s Cluster Optimization Tool
Big Data Technology & Architecture
Big Data Technology & Architecture
Dec 31, 2024 · Big Data

Eliminating Shuffle in Spark Joins with Storage Partitioned Join (SPJ) for Iceberg Tables

This article explains how Spark ≥ 3.3 introduces Storage Partitioned Join (SPJ) to avoid costly shuffle operations when joining partitioned V2 source tables such as Apache Iceberg, detailing the required conditions, configuration settings, practical code examples, and various join scenarios including mismatched partitions and data skew.

BucketingData SkewSQL
0 likes · 15 min read
Eliminating Shuffle in Spark Joins with Storage Partitioned Join (SPJ) for Iceberg Tables
Data Thinking Notes
Data Thinking Notes
Oct 27, 2022 · Big Data

Boost Spark Performance: Proven Code Optimizations & Tuning Tips

This article outlines practical Spark job optimization techniques—from code-level improvements and resource tuning to data skew handling, persistence strategies, shuffle reduction, broadcast variables, Kryo serialization, and efficient data structures—demonstrating how each can dramatically cut execution time.

Big DataKryo SerializationRDD Persistence
0 likes · 19 min read
Boost Spark Performance: Proven Code Optimizations & Tuning Tips
DataFunTalk
DataFunTalk
Apr 28, 2021 · Big Data

Accelerating Apache Spark 3.0 with NVIDIA RAPIDS: Architecture, Performance Gains, and New Features

This article explains how NVIDIA's RAPIDS Accelerator leverages GPUs to speed up Apache Spark 3.0 workloads, detailing the underlying architecture, benchmark results on TPC‑DS and recommendation models, required configuration changes, supported operators, shuffle optimizations, and the enhancements introduced in versions 0.2 and 0.3.

Apache SparkBig DataGPU Acceleration
0 likes · 19 min read
Accelerating Apache Spark 3.0 with NVIDIA RAPIDS: Architecture, Performance Gains, and New Features
JD Tech
JD Tech
Feb 8, 2021 · Big Data

JD Remote Shuffle Service: Design, Implementation, and Performance Evaluation

This article presents JD's self‑developed Remote Shuffle Service for Spark, detailing its architecture, goals, implementation details, performance benchmarks, and real‑world production case studies that demonstrate its impact on shuffle efficiency and system stability in large‑scale data processing.

Distributed SystemsRemote Shuffle ServiceShuffle Optimization
0 likes · 17 min read
JD Remote Shuffle Service: Design, Implementation, and Performance Evaluation
JD Retail Technology
JD Retail Technology
Jan 19, 2021 · Big Data

Design, Implementation, and Performance Evaluation of JD's Remote Shuffle Service for Spark

This article describes JD's research and production deployment of a self‑developed Remote Shuffle Service for Spark, covering its motivations, architectural design, cloud‑native features, monitoring, performance benchmarks against external shuffle solutions, and a real‑world promotion‑period case study that demonstrates improved stability and resource efficiency.

Cloud NativeRemote Shuffle ServiceShuffle Optimization
0 likes · 17 min read
Design, Implementation, and Performance Evaluation of JD's Remote Shuffle Service for Spark
DataFunTalk
DataFunTalk
Nov 13, 2019 · Big Data

ByteDance’s Core Optimization Practices on Spark SQL

ByteDance’s data warehouse team shares comprehensive optimizations for Spark SQL, covering architecture overview, bucket join enhancements, materialized columns and views, and shuffle stability and performance improvements, illustrating practical techniques that boost query efficiency and job reliability in large‑scale big‑data environments.

Big DataMaterialized ColumnsShuffle Optimization
0 likes · 20 min read
ByteDance’s Core Optimization Practices on Spark SQL
dbaplus Community
dbaplus Community
Aug 21, 2018 · Big Data

Master Spark Performance: Practical Development and Resource Tuning Guide

This article explains why Spark needs careful performance tuning, then details concrete development‑level optimizations (RDD reuse, persistence, shuffle avoidance, broadcast variables, Kryo serialization, data‑structure choices) and resource‑level settings (executor count, memory, cores, parallelism, memory fractions) with code examples and practical recommendations.

Broadcast VariablesKryo SerializationRDD
0 likes · 32 min read
Master Spark Performance: Practical Development and Resource Tuning Guide
Architecture Digest
Architecture Digest
May 25, 2016 · Big Data

Advanced Spark Performance Optimization: Data Skew and Shuffle Tuning

This article provides a comprehensive guide on tackling Spark performance bottlenecks by diagnosing data skew, locating the offending stages and operators, and applying a range of practical solutions—including Hive pre‑processing, key filtering, shuffle parallelism, two‑stage aggregation, map‑join, and combined strategies—followed by an in‑depth discussion of shuffle manager evolution and key configuration parameters for fine‑tuning.

Big DataData SkewShuffle Optimization
0 likes · 35 min read
Advanced Spark Performance Optimization: Data Skew and Shuffle Tuning