Big Data 9 min read

Understanding Spark Core, RDD, and Scheduler Components: A Practical Guide

This article introduces Spark's core concepts, explains the RDD abstraction and its four main properties, and details the roles of DAGScheduler, SchedulerBackend, TaskScheduler, and ExecutorBackend, providing practical insights for beginners and intermediate users in big‑data processing.

Big Data Technology & Architecture

Dec 1, 2021

Understanding Spark Core, RDD, and Scheduler Components: A Practical Guide

Preface

The author announces a new series on Spark after completing a similar series on Flink, and shares several Chinese‑language resources for learning Flink and Spark, emphasizing that Spark remains popular in Europe and North America for batch processing.

Spark Core

About RDD

RDD (Resilient Distributed Dataset) is the fundamental abstraction in Spark, representing a fault‑tolerant, read‑only collection of data partitions that can be processed in parallel across a cluster.

RDD is a special data collection with fault‑tolerance that can be distributed across cluster nodes and operated on functionally.

In plain terms, an RDD is a distributed, read‑only collection of partitioned records; each partition can reside on a different node, enabling parallel computation.

These two generic descriptions are unhelpful.

The author likens RDD to a series of potatoes at different processing stages, illustrating the four key properties: partitions: each potato piece corresponds to a data partition. partitioner: the rule that decides how raw potatoes are split into chips, analogous to the partitioning strategy. dependencies: the dependence of each processing stage on the previous one, similar to ingredient transformations. compute: the specific processing method applied at each stage.

Understanding these properties is essential for mastering Spark.

Key Scheduler Roles in Spark

DAGScheduler

DAGScheduler builds the DAG from user code, splits it into stages at shuffle boundaries, creates TaskSets for each stage, and submits them to the lower‑level TaskScheduler.

Construct DAG based on user code.

Divide stages using shuffle as a boundary.

Create TaskSets from stages and submit them to TaskScheduler.

DAGScheduler Stage‑Division Principle

Spark partitions data, transforms jobs into a DAG, and schedules them. When a shuffle (wide dependency) is required, the job is divided into multiple stages; otherwise, stages can run in parallel.

The example diagram shows how RDDs requiring shuffle become stage boundaries, while narrow‑dependency RDDs stay within the same stage.

SchedulerBackend

SchedulerBackend tracks the resources of executors in the cluster using an ExecutorDataMap (a hashmap of executor identifiers to their resource state) and offers WorkerOffer objects for task scheduling.

TaskScheduler

TaskScheduler creates and manages TaskSetManager, handles task locality and failures, retries straggling tasks, and reports execution status (including shuffle fetch failures) back to DAGScheduler.

ExecutorBackend

ExecutorBackend receives tasks from the scheduler, dispatches them to executor threads, and reports task completion status to the Driver via SchedulerBackend, completing the Spark job lifecycle.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data data processing Spark RDD DAGScheduler

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.