Big Data 18 min read

Unveiling Spark on YARN: From RDD Basics to Cluster Execution

This article explains Apache Spark’s core concepts, the RDD programming model, how Spark runs on YARN with driver and executor nodes, the distinction between transformations and actions, partitioning strategies, and an overview of SparkSQL processing.

21CTO

Mar 30, 2016

Unveiling Spark on YARN: From RDD Basics to Cluster Execution

Apache Spark is a big‑data processing framework built for speed, ease of use, and complex analytics, originally created at UC Berkeley in 2009 and released as an Apache open‑source project in 2010.

Spark Overview

Spark is the central component of many big‑data platforms, providing a distributed programming framework that implements MapReduce‑style operators such as map and reduce, as well as richer operators like filter, join, and groupByKey. It abstracts distributed data as Resilient Distributed Datasets (RDDs) and offers APIs for task scheduling, RPC, serialization, and compression, all written in Scala and exposing a Scala‑like programming interface.

Spark on YARN

The diagram shows the end‑to‑end process from job submission to completion when Spark runs on YARN.

1. Client Operations

Initialize yarnClient using yarnConf and start it.

Create an Application object and obtain its ID, then verify that the cluster has enough resources for executors and the ApplicationMaster; otherwise throw IllegalArgumentException.

Set resources and environment variables, including the staging directory, local resources (jar files, log4j.properties), and launch the Container with its Context.

Configure the application’s context (name, queue, AM container, mark job type as Spark).

Request memory and finally submit the application to the ResourceManager via yarnClient.submitApplication.

After submission, the client can exit; the job runs entirely on the YARN cluster, with results stored in HDFS or logs.

2. YARN Cluster Operations

Run the

ApplicationMaster

run

method.

Set necessary environment variables.

Create and start amClient.

Configure the Spark UI before it starts, setting AmIpFilter.

In startUserClass, launch a driver thread that creates the SparkContext.

Wait for SparkContext initialization, retrying up to spark.yarn.applicationMaster.waitTries (default 10); if it exceeds, the application fails.

When the driver and SparkContext are ready, register the ApplicationMaster with the ResourceManager via amClient.

Allocate and start executors: the yarnAllocator obtains the number of executors, creates containers, and launches them.

If any step fails, the application status is marked FAILED and the SparkContext is shut down.

Driver and Executor Nodes

Driver runs the user’s main() method, creates the SparkContext, builds RDDs, and translates logical operations into a DAG of stages and tasks. It also schedules tasks to executors, tracks cached data locations, and performs data‑local scheduling to minimize network traffic.

Executor runs the tasks assigned by the driver, stores cached RDD partitions in memory, and returns results back to the driver. Executors are launched inside containers allocated by YARN.

RDD Basics

An RDD is an immutable, partitioned collection of records that can be created by reading external data or by parallelizing a local collection. RDDs support two types of operations:

Transformations (e.g., map, filter, join, groupByKey) produce a new RDD and are evaluated lazily.

Actions (e.g., count, collect, save) trigger computation and return results to the driver or write data to external storage.

Because transformations are lazy, Spark builds a lineage graph to track dependencies. Narrow dependencies (e.g., map) allow pipelined execution and efficient fault recovery, while wide dependencies (e.g., join) may require shuffling and data persistence.

RDD Dependencies

Narrow dependencies mean each child partition depends on a single parent partition, enabling pipeline execution and localized recomputation. Wide dependencies involve multiple parent partitions, requiring data shuffle and often persisting intermediate results to speed up recovery.

SparkSQL Overview

SparkSQL registers RDDs with schema information as temporary tables and allows SQL queries over them. When using the Hive integration, metadata is read from hive-site.xml, a HiveContext is created, and SQL (or HQL) queries are executed against Hive tables, returning results as RDDs.

SQL parsing in Spark follows a similar pipeline to traditional databases: the query is parsed into a logical tree, bound to catalog metadata, optimized via rule‑based transformations, and finally executed as a physical plan.

Sample Code Snippets

Creating an RDD from a text file:

val linesRDD = sc.textFile("/path/to/file")
val sparkLines = linesRDD.filter(line => line.contains("spark"))
val count = sparkLines.count()

Reading from MySQL via JDBC:

Class.forName("com.mysql.jdbc.Driver")
val conn = DriverManager.getConnection(jdbcUrl)
val stmt = conn.createStatement()
stmt.executeUpdate("UPDATE table SET col='value' WHERE id=1")
stmt.close()
val df = sqlContext.jdbc(jdbcUrl, "table_name").cache()

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

SparkSQL YARN Apache Spark RDD

Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.