Understanding Spark Core, RDD, and Scheduler Components: A Practical Guide
This article introduces Spark's core concepts, explains the RDD abstraction and its four main properties, and details the roles of DAGScheduler, SchedulerBackend, TaskScheduler, and ExecutorBackend, providing practical insights for beginners and intermediate users in big‑data processing.
Preface
The author announces a new series on Spark after completing a similar series on Flink, and shares several Chinese‑language resources for learning Flink and Spark, emphasizing that Spark remains popular in Europe and North America for batch processing.
Spark Core
About RDD
RDD (Resilient Distributed Dataset) is the fundamental abstraction in Spark, representing a fault‑tolerant, read‑only collection of data partitions that can be processed in parallel across a cluster.
RDD is a special data collection with fault‑tolerance that can be distributed across cluster nodes and operated on functionally.
In plain terms, an RDD is a distributed, read‑only collection of partitioned records; each partition can reside on a different node, enabling parallel computation.
These two generic descriptions are unhelpful.The author likens RDD to a series of potatoes at different processing stages, illustrating the four key properties: partitions: each potato piece corresponds to a data partition. partitioner: the rule that decides how raw potatoes are split into chips, analogous to the partitioning strategy. dependencies: the dependence of each processing stage on the previous one, similar to ingredient transformations. compute: the specific processing method applied at each stage.
Understanding these properties is essential for mastering Spark.
Key Scheduler Roles in Spark
DAGScheduler
DAGScheduler builds the DAG from user code, splits it into stages at shuffle boundaries, creates TaskSets for each stage, and submits them to the lower‑level TaskScheduler.
Construct DAG based on user code.
Divide stages using shuffle as a boundary.
Create TaskSets from stages and submit them to TaskScheduler.
DAGScheduler Stage‑Division Principle
Spark partitions data, transforms jobs into a DAG, and schedules them. When a shuffle (wide dependency) is required, the job is divided into multiple stages; otherwise, stages can run in parallel.
The example diagram shows how RDDs requiring shuffle become stage boundaries, while narrow‑dependency RDDs stay within the same stage.
SchedulerBackend
SchedulerBackend tracks the resources of executors in the cluster using an ExecutorDataMap (a hashmap of executor identifiers to their resource state) and offers WorkerOffer objects for task scheduling.
TaskScheduler
TaskScheduler creates and manages TaskSetManager, handles task locality and failures, retries straggling tasks, and reports execution status (including shuffle fetch failures) back to DAGScheduler.
ExecutorBackend
ExecutorBackend receives tasks from the scheduler, dispatches them to executor threads, and reports task completion status to the Driver via SchedulerBackend, completing the Spark job lifecycle.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
