Big Data 10 min read

Deploying Apache Spark on YARN vs Kubernetes: Architecture, Benefits, and Comparison

This article explains how Apache Spark can be deployed using the traditional Hadoop YARN resource manager and the newer Kubernetes approach, detailing configuration steps, submission methods, and a comprehensive comparison of isolation, scalability, learning curve, logging, performance, and cost considerations.

DevOps

Jun 7, 2023

Deploying Apache Spark on YARN vs Kubernetes: Architecture, Benefits, and Comparison

Apache Spark is an open‑source data‑processing framework that distributes tasks across a cluster, but it relies on an external resource manager to schedule and launch those tasks. Historically, Hadoop YARN has been used for this purpose, requiring extensive configuration of Java, Scala, YARN, and Spark on each node.

With the rise of cloud computing and containers, Spark 3.3.1 introduced native support for Kubernetes, decoupling Spark from HDFS and offering greater flexibility. The article outlines the traditional YARN deployment, including the two launch modes (yarn‑client and yarn‑cluster) and the use of spark-submit to submit jobs.

It then discusses the challenges of YARN such as complex initial setup, maintenance overhead, and difficulty scaling resources dynamically during traffic spikes, which can lead to under‑utilized hardware and higher costs.

The Kubernetes deployment section describes how Spark on Kubernetes works: a driver pod and multiple executor pods are created when a job is submitted, and the job completes when the pods finish. Submission methods include direct spark-submit with a Kubernetes master URL, using the Spark Operator (installed via Helm) with YAML manifests, and integrating with Airflow via SparkKubernetesOperator and SparkKubernetesSensor.

A side‑by‑side comparison highlights that Kubernetes provides finer‑grained environment isolation via containers, dynamic scaling of executor pods, and better cost control in cloud environments, while requiring developers to understand pod concepts and cloud basics. Logging differs because logs reside in individual pods and must be persisted externally for later analysis.

Performance differences are minimal (≈4.5% according to TPC‑DS benchmarks), but Kubernetes may lose data locality when accessing HDFS, a gap mitigated by modern network speeds. Cost advantages stem from on‑demand resource provisioning, avoiding idle hardware.

In conclusion, Spark on Kubernetes offers a flexible, cost‑effective solution for cloud‑native deployments, especially when combined with DevOps practices, while YARN remains suitable for on‑premises Hadoop ecosystems or when extensive local data storage is required.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Kubernetes YARN Spark

Written by

DevOps

Share premium content and events on trends, applications, and practices in development efficiency, AI and related technologies. The IDCF International DevOps Coach Federation trains end‑to‑end development‑efficiency talent, linking high‑performance organizations and individuals to achieve excellence.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.