Big Data 18 min read

Integrating Apache Spark with Cloud‑Native Technologies: Principles, Kubernetes Deployments, EMR on ACK, and Serverless Spark on DLF

This article examines the challenges of traditional Spark clusters and explains how integrating Spark with cloud‑native platforms—through Kubernetes deployment modes, EMR on ACK practices, Remote Shuffle Service, and serverless Spark on DLF—provides elastic scaling, lower operational costs, and advanced features such as executor rolling and custom scheduler support.

DataFunSummit

Oct 30, 2022

Integrating Apache Spark with Cloud‑Native Technologies: Principles, Kubernetes Deployments, EMR on ACK, and Serverless Spark on DLF

Traditional Spark clusters deployed on Hadoop or EMR suffer from complex installation, limited elasticity, and tight coupling of storage and compute.

Combining Spark with cloud‑native concepts such as containerization and micro‑services brings elastic scaling, job‑centric deployments, and lower operational costs.

Spark on Kubernetes can run in four deployment modes (Standalone, YARN, Mesos, Kubernetes) and supports two submission approaches: the native spark‑submit with a K8s master URL, and the Spark‑on‑K8s operator that treats jobs as custom resources, offering richer management features.

Since Spark 2.3, native K8s support has matured, and Spark 3.1+ adds dynamic allocation, custom scheduler integration, and other GA features; Spark 3.3 introduces executor rolling and built‑in third‑party scheduler support.

On Alibaba Cloud, EMR on ACK provides a semi‑managed Spark platform where a dedicated namespace, Spark operator, and History Server are installed; elastic container instances (ECI) and Remote Shuffle Service (RSS) further improve resource elasticity and shuffle performance.

Serverless Spark in DLF abstracts the underlying cluster completely, exposing a stateless DLF‑SQL service backed by Apache Livy and Spark sessions; optimizations include pre‑warming sessions, parallel statement execution, enhanced error reporting, and support for Delta, Hudi, Iceberg formats.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Serverless Big Data Kubernetes Spark DLF

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.