Big Data 19 min read

Understanding Spark RDD Logical Execution Graph and Dependency Types

This article explains how Spark builds the logical execution graph for RDDs, describes the four-step job processing pipeline, details the various dependency types such as NarrowDependency and ShuffleDependency, and reviews common transformations and their data‑flow characteristics.

Big Data Technology & Architecture
Big Data Technology & Architecture
Big Data Technology & Architecture
Understanding Spark RDD Logical Execution Graph and Dependency Types

Introduction

The article revisits the differences between Spark's Cache and Checkpoint and then focuses on how Spark constructs the logical execution graph for a job, detailing the data dependencies between RDDs.

General Logical Plan

A typical Spark job follows four steps: (1) read data from a source (file, memory, HDFS, HBase, etc.) to create the initial RDD; (2) apply a series of transformation() operations that generate new RDDs; (3) perform an action() on the final RDD to produce results; (4) send the results back to the driver for final computation.

RDDs can be cached in memory or checkpointed to disk; the number of partitions is user‑defined, and partition dependencies may be 1‑to‑1, N‑to‑1, or N‑to‑N.

Logical Execution Graph Generation

When writing a program, developers often imagine a data‑dependency graph, but the actual generated graph contains many more RDDs due to the internal implementation of transformations.

How are RDDs created?

How are dependencies between RDDs established?

1. How are RDDs created?

Each transformation() returns a new RDD. Some transformations are complex and produce multiple child RDDs, which explains why the actual number of RDDs exceeds the mental model.

Each RDD has a compute() method that receives input records from its parent RDD or source, performs the transformation logic, and outputs records.

2. How are dependencies established?

Dependencies are divided into three questions:

Is the new RDD dependent on a single parent or multiple parents?

How many partitions does the new RDD have?

What is the partition‑level relationship between the child and its parents?

Example of a multi‑parent dependency: x = rdda.transformation(rddb) Determining the number of partitions:

max(numPartitions[parent RDD 1], .., numPartitions[parent RDD n])

Partition‑level relationships depend on the semantics of the transformation. For instance, map() is 1:1, while groupByKey() creates a ShuffledRDD where each partition depends on all parent partitions.

Dependency Types

Two main categories:

NarrowDependency (complete dependency, shown with black solid or dashed arrows):

OneToOneDependency (1:1)

N:1 NarrowDependency

N:N NarrowDependency (rare)

RangeDependency (used only by UnionRDD)

ShuffleDependency (partial dependency, shown with red arrows)

In practice, most RDDs use NarrowDependency when the child partition uses at most one parent partition; otherwise a ShuffleDependency is introduced.

Typical Transformations and Their Dependency Graphs

1) union(otherRDD)

Simply concatenates two RDDs without changing partition contents. Internally it uses a RangeDependency (1:1) to preserve original range boundaries.

2) groupByKey(numPartitions)

Aggregates records with the same key via a shuffle, producing a ShuffledRDD followed by a MapPartitionsRDD. No map‑side combine is performed.

Map‑side combine can be enabled when the key frequency is very high.

3) reduceByKey(func, numPartitions)

Implements a classic MapReduce pattern: map‑side combine creates a MapPartitionsRDD, then a shuffle produces a ShuffledRDD, and finally a reduce step yields another MapPartitionsRDD.

4) distinct(numPartitions)

Deduplicates records by converting each element to <K, null>, shuffling via reduceByKey, and then mapping back to the original key.

5) cogroup(otherRDD, numPartitions)

Aggregates two or more RDDs; the resulting CoGroupedRDD has partitions determined by the user’s partitioner. If partitioners differ, a ShuffleDependency is required.

Implementation hint:

CoGroupedRDD.getDependencies()

6) intersection(otherRDD)

Finds common records by converting each RDD to <T, null>, applying cogroup, filtering empty groups, and finally extracting keys.

7) join(otherRDD, numPartitions)

Performs an SQL‑style join by cogroup to obtain <K, (Iterable[V1], Iterable[V2])>, then computes the Cartesian product of the iterables.

8) sortByKey(ascending, numPartitions)

Shuffles records to appropriate partitions and sorts each partition locally.

Sorting is implemented by collecting all records of a partition into an array and then applying a sort algorithm.

9) cartesian(otherRDD)

Generates the Cartesian product; the number of partitions equals the product of the two input RDDs’ partition counts. All dependencies are NarrowDependency because each child partition fully depends on one parent partition from each side.

Dependency retrieval: CartesianRDD.getDependencies()

10) coalesce(numPartitions, shuffle = false)

Reduces the number of partitions without a shuffle; it merges existing partitions based on size, locality, and balance. Example: a.coalesce(3, shuffle = false) When shuffle = true, records are re‑partitioned evenly by assigning incremental keys and applying a hash‑based shuffle. Example key generation:

var position = (new Random(index)).nextInt(numPartitions); position = position + 1

11) repartition(numPartitions)

Equivalent to coalesce(numPartitions, shuffle = true).

Primitive Transformation: combineByKey()

The core primitive used by many higher‑level transformations. Its signature is:

def combineByKey[C](createCombiner: V => C,
      mergeValue: (C, V) => C,
      mergeCombiners: (C, C) => C,
      partitioner: Partitioner,
      mapSideCombine: Boolean = true,
      serializer: Serializer = null): RDD[(K, C)]

It works by creating an initial combiner for the first value of a key, merging subsequent values with mergeValue, and finally merging combiners from different partitions with mergeCombiners.

Discussion

The article concludes that the logical execution graph reveals the complex data‑dependency and computation logic hidden behind Spark’s simple API. Understanding these dependencies helps developers reason about performance, shuffle costs, and the suitability of different transformations for a given workload.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

TransformationbigdataSparkdependencyShuffleRDD
Big Data Technology & Architecture
Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.