Big Data 14 min read

Introduction to Apache Spark: Architecture, RDD, Spark on YARN, and SparkSQL

This article introduces Apache Spark’s core architecture, explains how Spark runs on YARN, details driver and executor roles, describes RDD concepts and dependencies, and outlines SparkSQL’s schema‑based query processing, providing code examples for HiveContext and JDBC integration.

Architecture Digest

Apr 18, 2016

Introduction to Apache Spark: Architecture, RDD, Spark on YARN, and SparkSQL

Apache Spark is a core component of BDAS, a distributed programming framework that extends MapReduce with richer operators such as filter, join, and groupByKey, providing a fast, scalable platform for cluster computing.

Spark abstracts data as Resilient Distributed Datasets (RDDs), supports lazy transformations and actions, and runs on various cluster managers, notably YARN, where a client initializes a YarnClient, creates an Application, submits resources, and the ApplicationMaster coordinates drivers and executors.

The driver process runs the user program, creates SparkContext, schedules tasks, and communicates with executors; executors run tasks, cache RDD partitions, and return results. Spark’s DAG scheduler builds execution plans from transformations, and the task scheduler dispatches tasks to workers.

RDDs are immutable partitioned collections; transformations produce new RDDs lazily, while actions trigger computation. Dependencies can be narrow or wide, affecting fault recovery and shuffle behavior.

SparkSQL adds a schema layer to RDDs, allowing SQL queries via SqlContext or HiveContext, with parsing, analysis, and optimization similar to traditional databases.

import org.apache.spark.sql.hive.HiveContext
val hiveCtx = new org.apache.spark.sql.hive.HiveContext(sc)
val rows = hiveCtx.sql("SELECT name, age FROM users")
val firstRow = rows.first()
println(firstRow.getString(0))

Class.forName("com.mysql.jdbc.Driver")
val conn = DriverManager.getConnection(mySQLUrl)
val stat1 = conn.createStatement()
stat1.execute("UPDATE CI_LABEL_INFO set DATA_STATUS_ID = 2 , DATA_DATE ='" + dataDate +"' where LABEL_ID in ("+allCreatedLabels.mkString(",")+")")
stat1.close()

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data SparkSQL Distributed Computing YARN Spark RDD

Written by

Architecture Digest

Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.