Big Data 19 min read

Optimizing Spark mapPartitions: Memory Management and Best Practices

The article details how Meituan’s Turing machine‑learning platform cut offline resource use by 80% and task time by 63% through memory‑level techniques such as column pruning, adaptive caching, and a deep dive into Spark’s mapPartitions operator, including source‑code analysis, GC behavior, and a low‑memory batch‑iterator best practice.

Meituan Technology Team

Nov 10, 2022

Optimizing Spark mapPartitions: Memory Management and Best Practices

Introduction

Turing, Meituan’s in‑house algorithm platform launched in 2018, provides an end‑to‑end service for model lifecycle, aiming to free engineers from engineering overhead so they can focus on algorithmic improvements.

After months of iterative optimization, the platform achieved an 80% reduction in offline resource consumption and a 63% decrease in average task duration, as shown in the resource‑consumption chart.

1. Business Background

The platform comprises four core components: Machine‑Learning Platform, Feature Platform, Online Serving, and AB‑testing. Its offline training engine is built on Spark. As user volume grew, optimizing the training engine’s throughput became critical for saving compute resources.

2. Turing Training Engine Optimizations

Optimizations are categorized into three layers:

Memory optimization – column pruning, adaptive cache, operator optimization.

Computation optimization – graph optimization, Spark source‑code tweaks, XGBoost source tweaks.

Disk‑IO optimization – automated small‑file merging.

Column pruning removes unused fields from datasets, reducing memory footprint. Adaptive cache decides when to persist or unpersist datasets based on runtime characteristics. The most impactful technique is operator optimization, which dramatically improves throughput.

3. Spark Operator Deep Dive

A table of Spark operator development tips is presented, highlighting multi‑row input/output, multi‑column output, intermediate result reuse, and heavyweight object reuse at executor or partition level.

Among these, mapPartitions stands out for its flexibility: it can map M input rows to N output rows, emit arbitrary columns, and reuse heavyweight objects across partitions.

4. The Pitfall of mapPartitions

Typical code creates a heavyweight object once per partition and reuses it, reducing GC pressure compared to per‑row object creation. However, this pattern can cause memory overflow when the buffer grows unchecked.

dataset.mapPartitions((MapPartitionsFunction<Row, Row>) iterator -> {
  HeavyObject obj = new HeavyObject();
  List<Row> list = new ArrayList<>();
  while (iterator.hasNext()) {
    Row row = iterator.next();
    obj.process(row);
    list.add(...);
  }
  return list.iterator();
}, RowEncoder.apply(schema));

When multiple mapPartitions stages are chained, each stage creates its own buffer, leading to memory peaks equal to twice the partition size.

5. Spark Pipeline and mapPartitions

Spark’s lazy execution builds a DAG of Transformations and Actions. mapPartitions functions are nested, forming a pipeline such as fCount(funcC(funcB(funcA))). The source code of ResultTask.runTask and RDD.iterator shows how each stage’s iterator is obtained and how checkpoints affect execution.

override def runTask(context: TaskContext): U = {
  val (rdd, func) = ser.deserialize[(RDD[T], (TaskContext, Iterator[T]) => U)](...)
  func(context, rdd.iterator(partition, context))
}

final def iterator(split: Partition, context: TaskContext): Iterator[T] = {
  if (storageLevel != StorageLevel.NONE) getOrCompute(split, context) else computeOrReadCheckpoint(split, context)
}

Experimental verification with a single‑partition RDD of 3 million rows (≈180 MiB) showed that chaining three mapPartitions stages produced a memory peak of about 360 MiB—exactly twice the partition data size—confirming the hypothesis.

6. Best Practice: Low‑Memory Batch Iterator

To avoid the double‑memory issue, a batch‑iterator demo processes data in fixed‑size batches (e.g., 5 input rows → 2 output rows). The buffer size never exceeds the batch size, keeping memory usage minimal.

Dataset<Row> result = dataset.mapPartitions((MapPartitionsFunction<Row, Row>) inputIterator -> new Iterator<Row>() {
  private static final int INPUT_BATCH_PROCESS_SIZE = 5;
  private final List<Row> batchRows = new ArrayList<>(INPUT_BATCH_PROCESS_SIZE);
  private Iterator<Row> batchResult = Collections.emptyIterator();

  @Override public boolean hasNext() {
    if (!batchResult.hasNext()) {
      batchRows.clear();
      int count = 0;
      while (count++ < INPUT_BATCH_PROCESS_SIZE && inputIterator.hasNext()) {
        batchRows.add(inputIterator.next());
      }
      if (batchRows.isEmpty()) return false;
      batchResult = processBatch(batchRows); // randomly emit 2 rows
    }
    return true;
  }

  @Override public Row next() { return batchResult.next(); }
}, RowEncoder.apply(dataset.schema()));

When applied to a pipeline of three mapPartitions stages, each stage holds only one batch buffer, so memory consumption stays around the batch size (7 rows in the demo) instead of exploding.

7. Summary

The first article of the “Secrets of Turing ML Platform Performance” series explains memory‑level operator optimizations, especially the inner workings and pitfalls of Spark’s mapPartitions operator. By analyzing source code, GC behavior, and conducting controlled experiments, the authors propose a batch‑iterator pattern that eliminates the double‑partition memory blow‑up and greatly improves throughput. Future installments will cover additional optimization techniques.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Memory optimization Performance Tuning Spark mapPartitions

Written by

Meituan Technology Team

Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.