Big Data 17 min read

How Spark Runs on YARN: From Client Submission to Executor Execution

This article explains the end‑to‑end workflow of Spark on YARN, covering client initialization, ApplicationMaster actions, driver and executor roles, RDD fundamentals, SparkSQL processing, and practical code examples for building and tuning distributed Spark jobs.

21CTO
21CTO
21CTO
How Spark Runs on YARN: From Client Submission to Executor Execution

Spark Overview

Spark is the core component of the BDAS platform, a distributed programming framework that extends MapReduce with richer operators such as filter , join , and groupByKey . It abstracts distributed data as Resilient Distributed Datasets (RDD) and provides APIs inspired by Scala’s functional programming model.

Spark on YARN

Spark runs on YARN by submitting an application that the ResourceManager schedules across the cluster.

1. Client Operations

Initialize yarnClient using yarnConf and start it.

Create a client Application, obtain its ID, and verify that cluster resources satisfy the requested executor and ApplicationMaster resources; otherwise throw IllegalArgumentException.

Configure resources and environment variables, including the staging directory, local resources (JARs, log4j.properties), and container launch context.

Set the Application submission context: application name, queue, AM container, and mark the job type as Spark.

Request memory and submit the Application to the ResourceManager via yarnClient.submitApplication.

After submission, the client can exit; the job runs entirely within the YARN cluster and its results are stored in HDFS or logs.

2. YARN Operations

Run the ApplicationMaster’s run method.

Set necessary environment variables.

Create and start amClient.

Configure

Spark UI
AmIpFilter

before the UI starts.

Launch a driver thread that creates the SparkContext.

Wait for SparkContext initialization (default up to 10 retries); on timeout the application fails.

Register the ApplicationMaster with the ResourceManager once the driver is ready.

Allocate and start executors by obtaining containers from yarnAllocator and launching them.

Tasks run inside CoarseGrainedExecutorBackend, reporting status to the scheduler via Akka until the job completes.

Spark Node Concepts

Driver

The driver process runs the user’s main() method, creates the SparkContext, builds RDDs, and defines transformation and action operations. It translates the logical DAG into a physical execution plan and schedules tasks to executors.

Executor

Executors run the tasks assigned by the driver and return results. They host a block manager that caches RDD partitions in memory, enabling fast reuse across actions.

RDD Fundamentals

An RDD is an immutable distributed collection of objects partitioned across the cluster. RDDs are created either by reading external data sources or by parallelizing in‑memory collections.

RDDs support two types of operations:

Transformations (e.g., map, filter, join) produce a new RDD and are lazily evaluated.

Actions (e.g., count, collect, save) trigger computation and return results to the driver or write data to storage.

Spark records lineage between RDDs to recompute lost partitions. Narrow dependencies (one‑to‑one) allow pipelined execution and efficient fault recovery; wide dependencies (e.g., join) require data shuffling and broader recomputation.

Spark Example Workflow

Create RDDs.

Generate an execution plan; Spark pipelines transformations and splits the job into stages.

Schedule tasks for each stage, ensuring all tasks of a stage finish before moving to the next.

Data Partitioning

Spark controls data placement across nodes to minimize network traffic. Key‑value RDDs can be partitioned (e.g., hash partitioning) so that records with the same key reside on the same executor.

SparkSQL Shuffle Process

SparkSQL adds schema information to existing RDDs, registers them as tables, and executes SQL queries. When using Hive, Spark reads Hive metadata, creates SchemaRDD, and runs queries via HiveContext.

SparkSQL Parsing

The parser builds a logical plan tree, then performs binding, optimization, and physical planning similar to traditional databases. SparkSQL has two entry points: SQLContext (Catalyst parser) and HiveContext (supports HiveQL).

Code snippets for Hive integration:

import org.apache.spark.sql.hive.HiveContext
val hiveCtx = new HiveContext(sc)
val rows = hiveCtx.sql("SELECT name, age FROM users")
val firstRow = rows.first()
println(firstRow.getString(0))

Class.forName("com.mysql.jdbc.Driver")
val conn = DriverManager.getConnection(mySQLUrl)
val stmt = conn.createStatement()
stmt.execute("UPDATE CI_LABEL_INFO SET DATA_STATUS_ID = 2, DATA_DATE = '" + dataDate + "' WHERE LABEL_ID IN (" + allCreatedLabels.mkString(",") + ")")
stmt.close()
val dimTable = sqlContext.jdbc(mySQLUrl, "DIM_COC_INDEX_MODEL_TABLE_CONF").cache()
val targets = dimTable.filter("TABLE_DATA_CYCLE = " + TABLE_DATA_CYCLE).collect()
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

SparkSQLdistributed computingYARNSparkRDD
21CTO
Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.