Big Data 14 min read

Introduction to Apache Spark: Architecture, RDD, Spark on YARN, and SparkSQL

This article introduces Apache Spark’s core architecture, explains how Spark runs on YARN, details driver and executor roles, describes RDD concepts and dependencies, and outlines SparkSQL’s schema‑based query processing, providing code examples for HiveContext and JDBC integration.

Architecture Digest
Architecture Digest
Architecture Digest
Introduction to Apache Spark: Architecture, RDD, Spark on YARN, and SparkSQL

Apache Spark is a core component of BDAS, a distributed programming framework that extends MapReduce with richer operators such as filter, join, and groupByKey, providing a fast, scalable platform for cluster computing.

Spark abstracts data as Resilient Distributed Datasets (RDDs), supports lazy transformations and actions, and runs on various cluster managers, notably YARN, where a client initializes a YarnClient, creates an Application, submits resources, and the ApplicationMaster coordinates drivers and executors.

The driver process runs the user program, creates SparkContext, schedules tasks, and communicates with executors; executors run tasks, cache RDD partitions, and return results. Spark’s DAG scheduler builds execution plans from transformations, and the task scheduler dispatches tasks to workers.

RDDs are immutable partitioned collections; transformations produce new RDDs lazily, while actions trigger computation. Dependencies can be narrow or wide, affecting fault recovery and shuffle behavior.

SparkSQL adds a schema layer to RDDs, allowing SQL queries via SqlContext or HiveContext, with parsing, analysis, and optimization similar to traditional databases.

import org.apache.spark.sql.hive.HiveContext
val hiveCtx = new org.apache.spark.sql.hive.HiveContext(sc)
val rows = hiveCtx.sql("SELECT name, age FROM users")
val firstRow = rows.first()
println(firstRow.getString(0))
Class.forName("com.mysql.jdbc.Driver")
val conn = DriverManager.getConnection(mySQLUrl)
val stat1 = conn.createStatement()
stat1.execute("UPDATE CI_LABEL_INFO set DATA_STATUS_ID = 2 , DATA_DATE ='" + dataDate +"' where LABEL_ID in ("+allCreatedLabels.mkString(",")+")")
stat1.close()
Big DataSparkSQLdistributed computingYARNSparkRDD
Architecture Digest
Written by

Architecture Digest

Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.