Big Data 32 min read

Performance Optimization Techniques for Spark and Spark Streaming Applications

This article explains how to improve Spark and Spark Streaming performance by tuning serialization, broadcast variables, parallelism, batch intervals, memory usage, garbage collection, and Kafka integration, providing practical code examples and real‑world optimization results.

DataFunTalk

Aug 9, 2019

Performance Optimization Techniques for Spark and Spark Streaming Applications

When deploying Spark and Spark Streaming on a cluster, users often encounter slow execution, excessive resource consumption, or instability; targeted optimizations can dramatically improve efficiency and resource usage.

Data Serialization – Serialization overhead is significant in distributed applications. Spark offers Java serialization (slow, large) and Kryo serialization (fast, compact). To enable Kryo, set

conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")

and register custom classes with

conf.registerKryoClasses(Array(classOf[MyClass1], classOf[MyClass2]))

. Adjust spark.kryoserializer.buffer for large objects.

Broadcast Large Variables – Broadcasting large read‑only data (e.g., configuration tables) reduces network and serialization costs. Update a broadcast variable by unpersisting the old one and rebroadcasting the new value. Example wrapper:

import java.io.{ObjectInputStream, ObjectOutputStream}
import org.apache.spark.broadcast.Broadcast
import org.apache.spark.streaming.StreamingContext
import scala.reflect.ClassTag

case class BroadcastWrapper[T: ClassTag](
  @transient private val ssc: StreamingContext,
  @transient private val _v: T) {
  @transient private var v = ssc.sparkContext.broadcast(_v)
  def update(newValue: T, blocking: Boolean = false): Unit = {
    v.unpersist(blocking)
    v = ssc.sparkContext.broadcast(newValue)
  }
  def value: T = v.value
  private def writeObject(out: ObjectOutputStream): Unit = out.writeObject(v)
  private def readObject(in: ObjectInputStream): Unit = {
    v = in.readObject().asInstanceOf[Broadcast[T]]
  }
}

Usage:

val yourBroadcast = BroadcastWrapper[yourType](ssc, yourValue)
yourStream.transform(rdd => {
  if (System.currentTimeMillis - someTime > Conf.updateFreq) {
    yourBroadcast.update(newValue, true)
  }
  // other processing
})

Parallelism and Partitioning – Increasing the number of partitions improves resource utilization. Spark’s default parallelism can be set via Spark.default.parallelism, and specific operations accept a partition argument (e.g., reduceByKey(numPartitions)). For Kafka sources, match Spark partitions to Kafka partitions to achieve sufficient parallelism.

Batch Interval Settings – Choose a batch interval that balances latency and throughput. Monitor Total Delay in the Spark UI; keep batch processing time below the interval. Start with a conservative interval (5‑10 s) and adjust based on observed processing speed.

Memory Optimization – Spark memory is split into execution and storage regions. Tune spark.memory.fraction (default 0.6) and spark.memory.storageFraction (default 0.5). Use serialized storage levels (e.g., MEMORY_ONLY_SER) and consider spark.rdd.compress. Reduce object overhead by using primitive arrays, fastutil collections, and avoiding excessive wrapper objects.

Garbage Collection (GC) Optimization – Monitor GC logs with -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps. Adjust heap regions ( -Xmn for young generation) and consider G1GC for executors or CMS for Spark Streaming to minimize pause times. Increase spark.executor.extraJavaOptions accordingly.

Spark Streaming Specific Optimizations – Use Kryo serialization for DStreams, set appropriate storage levels, and clean up old data with streamingContext.remember. For Kafka, control ingestion rate via spark.streaming.kafka.maxRatePerPartition and align it with batch interval.

Example Kafka parallelism:

val numStreams = 5
val kafkaStreams = (1 to numStreams).map { i => KafkaUtils.createStream(...)}
val unifiedStream = streamingContext.union(kafkaStreams)
unifiedStream.print()

Real‑World Tuning Example – In a sentiment‑analysis pipeline processing millions of game comments, the team adjusted batch duration, Kafka pull rate, caching, parallelism, GC settings, and Kryo registration, achieving stable processing where Spark Streaming’s processing time stayed below the defined threshold.

Conclusion – Key takeaways: use Kryo serialization, broadcast large read‑only data, balance ingestion and processing rates, configure memory fractions, minimize object overhead, and select appropriate GC strategies. Optimization is an iterative process of observation, adjustment, and re‑evaluation.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Memory optimization Streaming serialization Performance Tuning Spark Broadcast Variables Kryo

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.