Big Data 12 min read

Spark Native and Cloud Native: Vectorized SQL Engines, Remote Shuffle, and EMR Serverless Spark Practices

This article explains the challenges of big‑data processing in the cloud era, introduces Spark’s native‑language SQL engine rewrites, discusses vectorization and code generation techniques, describes cloud‑native storage‑compute separation with Remote Shuffle services such as Apache Celeborn, and presents the production benefits of Alibaba Cloud’s EMR Serverless Spark.

DataFunSummit
DataFunSummit
DataFunSummit
Spark Native and Cloud Native: Vectorized SQL Engines, Remote Shuffle, and EMR Serverless Spark Practices

In the era of massive data growth and increasingly complex business scenarios, traditional big‑data frameworks face performance and resource‑management challenges, prompting a need for Spark optimizations that fit cloud environments and leverage Cloud Native principles.

The presentation is divided into four parts: background introduction, Spark Native (rewriting the SQL engine in native languages), Cloud Native (extreme elasticity with storage‑compute separation), and a practical case of EMR Serverless Spark.

Background : The Lakehouse architecture combines the openness of data lakes with the freshness of data warehouses. Spark serves as the core engine for lakehouse workloads, and Databricks’ Photon (a C++ vectorized engine) demonstrates 2‑5× performance gains over Apache Spark.

Spark Native :

Trend: moving from JVM‑based to native (C++, Rust) SQL engines.

Mainstream native solutions: Apache Gluten, Apache Comet, Blaze, each integrating vectorized execution libraries (Velox, ClickHouse, DataFusion) into Spark.

Key techniques: vectorization (processing columnar batches to reduce interpreter overhead) and code generation (compiling query‑specific code to eliminate interpretation).

Vectorization replaces the iterator model with columnar batches, improving cache locality and enabling SIMD acceleration. Codegen generates and compiles specialized code for each query, removing type checks and virtual calls.

Cloud Native :

Materialized shuffle is the foundation of BSP engines, enabling cheap task re‑execution and adaptive execution.

Local shuffle suffers from disk dependency, stateful executors, and inefficient all‑to‑all data transfer.

Remote Shuffle services (e.g., Apache Celeborn) decouple shuffle data from compute nodes, aggregate partitions, and transform shuffle reads into one‑to‑one transfers, greatly improving stability, performance, and elasticity.

Combining Spark Native with Cloud Native, Celeborn now supports Gluten and Blaze. The workflow partitions, serializes, and compresses shuffle data in the vectorized engine, stores it via Celeborn, and reads it back for vectorized processing.

EMR Serverless Spark :

Alibaba Cloud’s EMR Serverless Spark is a Lakehouse product that integrates OSS, lake formats (Paimon, Delta, Iceberg, Hudi), metadata management (DLF), and a highly optimized Spark engine with vectorization and Celeborn. It offers executor‑level elasticity and full ecosystem compatibility.

Production results show up to 60% resource savings and significant cost‑performance improvements across industries such as internet, gaming, new energy, finance, and manufacturing.

References: Lakehouse papers, Databricks Photon, Snowflake architecture, Dremel, MonetDB/X100, Hyper, various benchmark studies, and links to Apache projects (UnityCatalog, Polaris, Gravitino, Celeborn) and Alibaba EMR Serverless Spark documentation.

cloud nativeBig DataVectorizationSparkCodegenEMR ServerlessRemote Shuffle
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.