Master Spark Performance: Practical Development and Resource Tuning Guide
This article explains why Spark needs careful performance tuning, then details concrete development‑level optimizations (RDD reuse, persistence, shuffle avoidance, broadcast variables, Kryo serialization, data‑structure choices) and resource‑level settings (executor count, memory, cores, parallelism, memory fractions) with code examples and practical recommendations.
Spark has become a popular platform for batch, SQL, streaming, machine learning, and graph workloads, but achieving high performance requires systematic tuning. The author outlines a comprehensive optimization framework divided into development tuning, resource tuning, data‑skew tuning, and shuffle tuning, focusing here on development and resource tuning.
Development Tuning
Key principles for writing efficient Spark jobs include:
Avoid creating duplicate RDDs : Create a single RDD per data source and reuse it; otherwise Spark recomputes the same data, incurring extra I/O and CPU.
Reuse the same RDD whenever possible : When different transformations operate on the same underlying data, use the original RDD instead of creating a derived subset RDD.
Persist frequently used RDDs : Call cache() or persist() to keep an RDD in memory or on disk so subsequent actions do not recompute it from the source.
Minimize shuffle operations : Prefer map‑type transformations over shuffle‑heavy ones (reduceByKey, join, distinct, repartition). When a shuffle is unavoidable, use map‑side pre‑aggregation (e.g., reduceByKey instead of groupByKey).
Broadcast large static data : Use sc.broadcast() for variables larger than a few hundred MB to avoid copying them to every task.
Use Kryo for serialization : Set spark.serializer to org.apache.spark.serializer.KryoSerializer and register custom classes to gain up to ten‑fold speed improvements over Java serialization.
Choose memory‑efficient data structures : Prefer primitive types and arrays over objects, strings, and collection classes to reduce GC pressure.
Code snippets illustrate each principle, such as the correct way to read a file once and reuse the RDD, persisting with .cache() or .persist(StorageLevel.MEMORY_AND_DISK_SER), broadcasting a list, and configuring Kryo:
val conf = new SparkConf().setAppName("MyApp").setMaster(...)
conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
conf.registerKryoClasses(Array(classOf[MyClass1], classOf[MyClass2]))Resource Tuning
After the job logic is optimized, appropriate Spark resources must be allocated via spark-submit parameters. Mis‑configured resources lead to under‑utilization or OOM failures.
num-executors : Total number of executor processes; typical range 50‑100 for large clusters.
executor-memory : Memory per executor; 4‑8 GB is a common starting point.
executor-cores : CPU cores per executor; 2‑4 cores balance parallelism and overhead.
driver-memory : Memory for the driver; 1 GB is usually sufficient unless large collect() operations are used.
spark.default.parallelism : Desired number of tasks per stage; set to 2‑3 × (num-executors × executor-cores), e.g., 1000 tasks for 300 cores.
spark.storage.memoryFraction : Portion of executor memory for persisted RDDs (default 0.6). Increase if many RDDs are cached; decrease if shuffle memory pressure causes GC.
spark.shuffle.memoryFraction : Portion of executor memory for shuffle buffers (default 0.2). Adjust opposite to storage.memoryFraction based on workload.
A sample spark-submit command demonstrates a balanced configuration:
./bin/spark-submit \
--master yarn-cluster \
--num-executors 100 \
--executor-memory 6G \
--executor-cores 4 \
--driver-memory 1G \
--conf spark.default.parallelism=1000 \
--conf spark.storage.memoryFraction=0.5 \
--conf spark.shuffle.memoryFraction=0.3 \
your_app.jarUnderstanding Spark’s execution model—driver, executors, stages, tasks, and shuffle boundaries—helps map each parameter to a concrete part of the runtime, enabling systematic performance improvements.
Most Spark jobs that follow these development and resource tuning guidelines achieve satisfactory performance, though more advanced topics such as data‑skew handling and deep shuffle tuning may be required for extreme cases.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
