How Spark Runs on YARN: From Client Submission to Executor Execution
This article explains the end‑to‑end workflow of Spark on YARN, covering client initialization, ApplicationMaster actions, driver and executor roles, RDD fundamentals, SparkSQL processing, and practical code examples for building and tuning distributed Spark jobs.
Spark Overview
Spark is the core component of the BDAS platform, a distributed programming framework that extends MapReduce with richer operators such as filter , join , and groupByKey . It abstracts distributed data as Resilient Distributed Datasets (RDD) and provides APIs inspired by Scala’s functional programming model.
Spark on YARN
Spark runs on YARN by submitting an application that the ResourceManager schedules across the cluster.
1. Client Operations
Initialize yarnClient using yarnConf and start it.
Create a client Application, obtain its ID, and verify that cluster resources satisfy the requested executor and ApplicationMaster resources; otherwise throw IllegalArgumentException.
Configure resources and environment variables, including the staging directory, local resources (JARs, log4j.properties), and container launch context.
Set the Application submission context: application name, queue, AM container, and mark the job type as Spark.
Request memory and submit the Application to the ResourceManager via yarnClient.submitApplication.
After submission, the client can exit; the job runs entirely within the YARN cluster and its results are stored in HDFS or logs.
2. YARN Operations
Run the ApplicationMaster’s run method.
Set necessary environment variables.
Create and start amClient.
Configure
Spark UI AmIpFilterbefore the UI starts.
Launch a driver thread that creates the SparkContext.
Wait for SparkContext initialization (default up to 10 retries); on timeout the application fails.
Register the ApplicationMaster with the ResourceManager once the driver is ready.
Allocate and start executors by obtaining containers from yarnAllocator and launching them.
Tasks run inside CoarseGrainedExecutorBackend, reporting status to the scheduler via Akka until the job completes.
Spark Node Concepts
Driver
The driver process runs the user’s main() method, creates the SparkContext, builds RDDs, and defines transformation and action operations. It translates the logical DAG into a physical execution plan and schedules tasks to executors.
Executor
Executors run the tasks assigned by the driver and return results. They host a block manager that caches RDD partitions in memory, enabling fast reuse across actions.
RDD Fundamentals
An RDD is an immutable distributed collection of objects partitioned across the cluster. RDDs are created either by reading external data sources or by parallelizing in‑memory collections.
RDDs support two types of operations:
Transformations (e.g., map, filter, join) produce a new RDD and are lazily evaluated.
Actions (e.g., count, collect, save) trigger computation and return results to the driver or write data to storage.
Spark records lineage between RDDs to recompute lost partitions. Narrow dependencies (one‑to‑one) allow pipelined execution and efficient fault recovery; wide dependencies (e.g., join) require data shuffling and broader recomputation.
Spark Example Workflow
Create RDDs.
Generate an execution plan; Spark pipelines transformations and splits the job into stages.
Schedule tasks for each stage, ensuring all tasks of a stage finish before moving to the next.
Data Partitioning
Spark controls data placement across nodes to minimize network traffic. Key‑value RDDs can be partitioned (e.g., hash partitioning) so that records with the same key reside on the same executor.
SparkSQL Shuffle Process
SparkSQL adds schema information to existing RDDs, registers them as tables, and executes SQL queries. When using Hive, Spark reads Hive metadata, creates SchemaRDD, and runs queries via HiveContext.
SparkSQL Parsing
The parser builds a logical plan tree, then performs binding, optimization, and physical planning similar to traditional databases. SparkSQL has two entry points: SQLContext (Catalyst parser) and HiveContext (supports HiveQL).
Code snippets for Hive integration:
import org.apache.spark.sql.hive.HiveContext
val hiveCtx = new HiveContext(sc)
val rows = hiveCtx.sql("SELECT name, age FROM users")
val firstRow = rows.first()
println(firstRow.getString(0))
Class.forName("com.mysql.jdbc.Driver")
val conn = DriverManager.getConnection(mySQLUrl)
val stmt = conn.createStatement()
stmt.execute("UPDATE CI_LABEL_INFO SET DATA_STATUS_ID = 2, DATA_DATE = '" + dataDate + "' WHERE LABEL_ID IN (" + allCreatedLabels.mkString(",") + ")")
stmt.close()
val dimTable = sqlContext.jdbc(mySQLUrl, "DIM_COC_INDEX_MODEL_TABLE_CONF").cache()
val targets = dimTable.filter("TABLE_DATA_CYCLE = " + TABLE_DATA_CYCLE).collect()Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
21CTO
21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
