Introduction to Apache Spark: Architecture, RDD, Spark on YARN, and SparkSQL
This article introduces Apache Spark’s core architecture, explains how Spark runs on YARN, details driver and executor roles, describes RDD concepts and dependencies, and outlines SparkSQL’s schema‑based query processing, providing code examples for HiveContext and JDBC integration.
Apache Spark is a core component of BDAS, a distributed programming framework that extends MapReduce with richer operators such as filter, join, and groupByKey, providing a fast, scalable platform for cluster computing.
Spark abstracts data as Resilient Distributed Datasets (RDDs), supports lazy transformations and actions, and runs on various cluster managers, notably YARN, where a client initializes a YarnClient, creates an Application, submits resources, and the ApplicationMaster coordinates drivers and executors.
The driver process runs the user program, creates SparkContext, schedules tasks, and communicates with executors; executors run tasks, cache RDD partitions, and return results. Spark’s DAG scheduler builds execution plans from transformations, and the task scheduler dispatches tasks to workers.
RDDs are immutable partitioned collections; transformations produce new RDDs lazily, while actions trigger computation. Dependencies can be narrow or wide, affecting fault recovery and shuffle behavior.
SparkSQL adds a schema layer to RDDs, allowing SQL queries via SqlContext or HiveContext, with parsing, analysis, and optimization similar to traditional databases.
import org.apache.spark.sql.hive.HiveContext
val hiveCtx = new org.apache.spark.sql.hive.HiveContext(sc)
val rows = hiveCtx.sql("SELECT name, age FROM users")
val firstRow = rows.first()
println(firstRow.getString(0)) Class.forName("com.mysql.jdbc.Driver")
val conn = DriverManager.getConnection(mySQLUrl)
val stat1 = conn.createStatement()
stat1.execute("UPDATE CI_LABEL_INFO set DATA_STATUS_ID = 2 , DATA_DATE ='" + dataDate +"' where LABEL_ID in ("+allCreatedLabels.mkString(",")+")")
stat1.close()Architecture Digest
Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.