Big Data 33 min read

Spark Performance Optimization Guide: Development and Resource Tuning

This article provides a comprehensive guide to Spark performance optimization, covering development‑level tuning principles, resource configuration parameters, practical code examples, and best‑practice recommendations to achieve high‑throughput big‑data processing.

Architect

Apr 2, 2021

Spark Performance Optimization Guide: Development and Resource Tuning

Optimization Overview

Effective Spark performance tuning starts with applying basic development principles such as proper RDD lineage design, judicious use of operators, and optimizing special operations. These principles should be adapted to specific business scenarios and data characteristics.

Principle 1: Avoid Creating Duplicate RDDs

When building a Spark job, each data source should correspond to a single RDD. Creating multiple RDDs for the same data leads to repeated computation and unnecessary overhead.

Simple Example

// Incorrect: creating two RDDs for the same HDFS file
val rdd1 = sc.textFile("hdfs://192.168.0.1:9000/hello.txt")
rdd1.map(...)
val rdd2 = sc.textFile("hdfs://192.168.0.1:9000/hello.txt")
rdd2.reduce(...)

// Correct: reuse a single RDD
val rdd1 = sc.textFile("hdfs://192.168.0.1:9000/hello.txt")
rdd1.map(...)
rdd1.reduce(...)

Principle 2: Reuse the Same RDD Whenever Possible

If different operations can be performed on the same underlying data, use the original RDD rather than creating a new one that contains a subset of the data.

Simple Example

// Incorrect: creating a subset RDD and processing both separately
JavaPairRDD<Long, String> rdd1 = ...
JavaRDD<String> rdd2 = rdd1.map(...)

rdd1.reduceByKey(...)
rdd2.map(...)

// Correct: reuse rdd1 for both operations
JavaPairRDD<Long, String> rdd1 = ...
rdd1.reduceByKey(...)
rdd1.map(tuple._2...)

Principle 3: Persist Frequently Used RDDs

Cache or persist an RDD that is accessed multiple times so that it is computed only once and stored in memory or disk.

Persist Code Example

// Using cache()
val rdd1 = sc.textFile("hdfs://192.168.0.1:9000/hello.txt").cache()
rdd1.map(...)
rdd1.reduce(...)

// Using persist() with a specific storage level
val rdd1 = sc.textFile("hdfs://192.168.0.1:9000/hello.txt").persist(StorageLevel.MEMORY_AND_DISK_SER)
rdd1.map(...)
rdd1.reduce(...)

Principle 4: Minimize Shuffle Operations

Shuffle is the most expensive part of a Spark job. Avoid using shuffle‑heavy operators (e.g., reduceByKey, join) when possible, and prefer map‑type operators.

Broadcast Join Example

// Traditional join (causes shuffle)
val rdd3 = rdd1.join(rdd2)

// Broadcast + map join (no shuffle)
val rdd2Data = rdd2.collect()
val rdd2Broadcast = sc.broadcast(rdd2Data)
val rdd3 = rdd1.map(rdd2Broadcast...)

Principle 5: Use Map‑Side Pre‑Aggregation

When shuffle is unavoidable, prefer operators like reduceByKey or aggregateByKey that perform local aggregation before shuffling, reducing data transferred across the network.

Principle 6: Prefer High‑Performance Operators

Use reduceByKey/aggregateByKey instead of groupByKey, mapPartitions instead of map, and foreachPartitions instead of foreach to reduce function‑call overhead and improve throughput.

Principle 7: Broadcast Large Variables

For large read‑only data (e.g., >100 MB), broadcast it so that each executor holds a single copy, avoiding repeated network transfer and excessive memory usage.

Broadcast Code Example

val list1 = ...
val list1Broadcast = sc.broadcast(list1)
rdd1.map(list1Broadcast...)

Principle 8: Use Kryo for Faster Serialization

Configure Spark to use KryoSerializer and register custom classes to achieve roughly ten‑fold faster serialization compared to Java's default.

val conf = new SparkConf().setMaster(...).setAppName(...)
conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
conf.registerKryoClasses(Array(classOf[MyClass1], classOf[MyClass2]))

Principle 9: Optimize Data Structures

Avoid memory‑heavy objects, strings, and collection types inside RDD transformations; prefer primitive types and arrays to reduce GC pressure.

Resource Parameter Tuning

After optimizing the job logic, configure Spark’s resource parameters (e.g., num‑executors, executor‑memory, executor‑cores, driver‑memory, spark.default.parallelism, spark.storage.memoryFraction, spark.shuffle.memoryFraction) to fully utilize cluster resources and avoid OOM or under‑utilization.

num‑executors

Set the total number of executor processes; typically 50‑100 for most workloads.

executor‑memory

Allocate 4‑8 GB per executor, adjusting based on queue limits.

executor‑cores

Assign 2‑4 CPU cores per executor to balance parallelism.

driver‑memory

Usually 1 GB is sufficient unless large collect operations are performed.

spark.default.parallelism

Set to 2‑3×(num‑executors × executor‑cores), e.g., 1000 tasks for a 300‑core cluster.

spark.storage.memoryFraction & spark.shuffle.memoryFraction

Adjust these fractions based on the relative amount of persisted RDDs versus shuffle intensity.

Example spark‑submit Command

./bin/spark-submit \
  --master yarn-cluster \
  --num-executors 100 \
  --executor-memory 6G \
  --executor-cores 4 \
  --driver-memory 1G \
  --conf spark.default.parallelism=1000 \
  --conf spark.storage.memoryFraction=0.5 \
  --conf spark.shuffle.memoryFraction=0.3 \

Following these development and resource tuning guidelines typically yields high‑performance Spark jobs, though more advanced techniques may be required for data skew or extreme performance demands.

END

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Optimization Big Data Spark Shuffle RDD Resource Tuning

Written by

Architect

Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.