Tagged articles

Remote Shuffle

3 articles · Page 1 of 1

Feb 1, 2025 · Big Data

Spark Native and Cloud Native: Vectorized SQL Engines, Remote Shuffle, and EMR Serverless Spark Practices

This article explains the challenges of big‑data processing in the cloud era, introduces Spark’s native‑language SQL engine rewrites, discusses vectorization and code generation techniques, describes cloud‑native storage‑compute separation with Remote Shuffle services such as Apache Celeborn, and presents the production benefits of Alibaba Cloud’s EMR Serverless Spark.

Big DataCodegenEMR Serverless

0 likes · 12 min read

Spark Native and Cloud Native: Vectorized SQL Engines, Remote Shuffle, and EMR Serverless Spark Practices

DataFunTalk

Dec 31, 2023 · Big Data

Apache Celeborn (Incubating): Addressing Traditional Shuffle Limitations in Big Data Processing

Apache Celeborn (Incubating) is a remote shuffle service designed to overcome the inefficiencies, high storage demands, network overhead, and limited fault tolerance of traditional Spark shuffle implementations by introducing push‑shuffle, partition splitting, columnar shuffle, multi‑layer storage, and elastic, stable, and scalable architectures.

Apache SparkBig DataPerformance Optimization

0 likes · 15 min read

Apache Celeborn (Incubating): Addressing Traditional Shuffle Limitations in Big Data Processing

ByteDance Cloud Native

Sep 2, 2022 · Big Data

How ByteDance’s Cloud Shuffle Service Boosts Big Data Job Stability and Performance

ByteDance’s Cloud Shuffle Service (CSS) replaces the traditional Pull‑Based Sort Shuffle in Spark, FlinkBatch and MapReduce with a Push‑Based remote shuffle that improves stability, performance and elasticity, supports compute‑storage separation, and delivers significant speedups in large‑scale TPC‑DS benchmarks.

Performance OptimizationRemote ShuffleShuffle Service

0 likes · 11 min read

How ByteDance’s Cloud Shuffle Service Boosts Big Data Job Stability and Performance