Big Data 9 min read

Understanding mapPartitions vs map in Apache Spark: Performance, Pitfalls, and Proper Usage

This article examines why many developers favor Spark's mapPartitions over map, analyzes the underlying source code, highlights common pitfalls such as complexity and OOM risks, and provides practical guidelines and code examples for correctly using mapPartitions in both simple and advanced scenarios.

Big Data Technology Architecture
Big Data Technology Architecture
Big Data Technology Architecture
Understanding mapPartitions vs map in Apache Spark: Performance, Pitfalls, and Proper Usage

In a recent code review the author noticed extensive use of mapPartitions in a colleague's Spark application and investigated the claim that it outperforms map . The article begins by presenting the popular belief that processing an entire partition with a single function call is faster because it reduces the number of function invocations.

Quoting common blog statements, the author shows that the perceived speed gain often stems from a misunderstanding: both map and mapPartitions ultimately process each record, so the total number of element‑wise operations does not change.

The article then dives into the actual Scala source code of the two operators. The map implementation wraps the user function with sc.clean and creates a MapPartitionsRDD that applies iter.map(cleanF) to each partition iterator. The mapPartitions implementation also creates a MapPartitionsRDD but passes the whole iterator to the user function, which returns another iterator.

def map[U: ClassTag](f: T => U): RDD[U] = withScope {
    val cleanF = sc.clean(f)
    new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.map(cleanF))
}
def mapPartitions[U: ClassTag](
      f: Iterator[T] => Iterator[U],
      preservesPartitioning: Boolean = false): RDD[U] = withScope {
    val cleanedF = sc.clean(f)
    new MapPartitionsRDD(
      this,
      (context: TaskContext, index: Int, iter: Iterator[T]) => cleanedF(iter),
      preservesPartitioning)
  }

From a practical standpoint, the author argues that mapPartitions offers no intrinsic performance advantage in most cases and should not be used indiscriminately.

Two major drawbacks are highlighted:

Usability: writing mapPartitions code is more verbose and error‑prone. The article provides a helper function to emulate map behavior, but notes that this still adds complexity.

Out‑of‑Memory (OOM) risk: developers often misuse the iterator by eagerly consuming all elements, defeating Spark's lazy evaluation and causing the entire partition to be loaded into memory.

To illustrate the OOM issue, the article shows a faulty pattern where the iterator is traversed inside a while (x.hasNext) { … } loop, preventing the lazy execution model.

Correct usage scenarios are then presented. When a single operation per partition (e.g., opening a database connection) is needed, mapPartitions can be beneficial because the connection is established once per partition instead of once per record. Example:

rdd.mapPartitions(x => {
  println("Connect to DB")
  val res = x.map(line => {
    println("Write: " + line)
    line
  })
  res
})

For more control, the article demonstrates creating a custom iterator that closes the connection when the partition is exhausted:

rdd1.mapPartitions(x => {
  println("Connect to DB")
  new Iterator[Any] {
    override def hasNext: Boolean = {
      if (x.hasNext) true else { println("Close DB"); false }
    }
    override def next(): Any = "Write: " + x.next()
  }
})

Advanced usage is discussed for one‑to‑many transformations. The author revisits the implementation of Iterator.flatMap , provides a simplified version with comments, and supplies a complete Spark example that uses a custom iterator to split comma‑separated strings into multiple output records.

val conf = new SparkConf().setMaster("local[1]").setAppName("test")
val sc = new SparkContext(conf)
sc.setLogLevel("WARN")
sc.parallelize(Seq("a,a,a,a,a"))
  .mapPartitions(iter => {
    new AbstractIterator[String] {
      def myF(data: String): Iterable[String] = {
        println(data)
        data.split(",").toIterable
      }
      var cur: Iterator[String] = Iterator.empty
      override def hasNext: Boolean = cur.hasNext || iter.hasNext && { cur = myF(iter.next).toIterator; hasNext }
      override def next(): String = (if (hasNext) cur else Iterator.empty).next()
    }
  })
  .foreach(println)

The article concludes that while mapPartitions can be useful for specific cases such as batch database writes or custom partition‑level logic, the simpler map operator should be preferred in most situations, and developers must remember to leverage the iterator's lazy execution to avoid memory problems.

PerformanceBig DataIteratorSparkScalamapPartitions
Big Data Technology Architecture
Written by

Big Data Technology Architecture

Exploring Open Source Big Data and AI Technologies

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.