Big Data 19 min read

Master Spark Performance: Practical Tuning Tips and Real‑World Examples

This article explains essential Spark concepts, illustrates common performance bottlenecks, and provides concrete tuning strategies for memory, CPU, serialization, data locality, file I/O, and shuffle reduction, backed by real‑world examples and visual metrics.

ITPUB
ITPUB
ITPUB
Master Spark Performance: Practical Tuning Tips and Real‑World Examples

Basic Concepts and Principles

Understanding Spark’s core architecture is a prerequisite for any performance optimization. Each host can run N workers, each worker can host M executors, and tasks are scheduled on executors. A stage groups parallel tasks; a shuffle marks a stage boundary.

CPU cores are allocated per executor; over‑provisioning cores without sufficient CPU utilization wastes resources. Adjusting the number of executors, cores per executor, or workers can improve CPU usage, but memory must be balanced to avoid spills or OOM.

Partitions define data slices; too few partitions cause large data per task and memory pressure, while too many increase overhead. Parallelism controls the default number of partitions for reduce‑type operations. Both are tuned via spark.default.parallelism or explicit partition arguments.

Practical Tuning Examples

Example 1: An EMR Spark job showed low CPU usage. Reducing cores per executor while increasing executor count and partition count raised CPU utilization and sped up processing.

Example 2: A job suffered OOM; increasing partition count reduced per‑task data size and, by decreasing executor parallelism, allocated more memory per task, mitigating OOM at the cost of slower execution.

Example 3: When input data is small but many tiny files are generated, reducing the number of partitions avoids unnecessary task creation.

Configuration Methods

Set environment variables for hardware‑related settings.

Pass command‑line arguments prefixed with double dashes for per‑run changes.

Configure programmatically via SparkConf inside the application.

Worker‑Executor Ratio Adjustments

Two common patterns:

Keep one executor per worker and increase the number of workers per host (e.g., SPARK_WORKER_INSTANCES), adjusting SPARK_WORKER_CORES accordingly.

Run a single worker per host and launch multiple executors inside it, setting spark.executor.cores and spark.executor.memory as fractions of the host’s resources.

On YARN, the spark.yarn.executor.memoryOverhead (default 10%) must be considered when calculating executor memory.

Memory Tuning

Java objects can occupy 2–5× the raw data size. Use cache and the Spark UI to monitor storage, or SizeEstimator for estimates. Enable -XX:+UseCompressedOops to shrink pointers.

Typical memory‑related errors include executor memory exceeding cluster limits and task scheduler warnings about insufficient resources.

Reduce‑side operations (e.g., reduceByKey, join) can be memory‑intensive; increasing parallelism or raising shuffle memory limits (often to 50% of executor memory) helps.

CPU and Parallelism

Set spark.default.parallelism or spark.default.parallelism to control task count; a rule of thumb is 2–3 tasks per CPU core.

Executor cores can be allocated exclusively or shared. Tests show exclusive core allocation can yield slightly better performance.

Serialization and Data Transfer

Java serialization is default but slow; Kryo offers better speed and compression when compatible.

Broadcast large read‑only variables (recommended when size > 20 KB) to avoid repeated shipping.

Data Locality

Prefer PROCESS_LOCAL , then NODE_LOCAL , RACK_LOCAL , and finally ANY . Spark’s spark.locality.wait governs how long it waits for a local slot before falling back.

File I/O Optimizations

Use columnar formats (Parquet, ORC) and read only required columns. When writing to S3 or HDFS, choose appropriate compression. Adjust coalesce to control output file count; too few or too many files hurt performance.

Task‑Level Optimizations

Enable speculation via spark.speculation to re‑run straggling tasks.

Minimize shuffle by designing joins that keep related keys on the same partition, or by reducing data size before shuffle (e.g., avoid groupByKey when possible).

Repartitioning can balance partition sizes but incurs shuffle overhead; monitor with rdd.partitions().size().

Distribute heavy resources (e.g., database connections) using mapPartitions to reuse them across tasks.

Code Example: Avoid Serializing Large Objects

Original code serializes the whole enclosing object:

rdd.map(r => {
    println(BackfillTypeIndex)
})

Refactor by extracting the variable before the closure:

val dereferencedVariable = this.BackfillTypeIndex
rdd.map(r => println(dereferencedVariable)) // "this" is not serialized

Mark large variables as @transient to exclude them from serialization.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Big DataMemory ManagementConfigurationCPU optimizationperformance tuningSpark
ITPUB
Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.