Big Data 11 min read

Practical Tips and Common Pitfalls for Tuning Apache Spark Performance

This article shares hands‑on experience from Spark Summit attendees, covering why Spark is powerful, common performance problems such as slow jobs, OOM, data skew, excessive partitions, and provides concrete tuning advice on executors, cores, memory, and debugging techniques.

Liulishuo Tech Team

Oct 17, 2016

Practical Tips and Common Pitfalls for Tuning Apache Spark Performance

Why Spark Is So Awesome

Spark became popular because developers can abstract many data sources as RDD, use high‑level operations that leverage memory, optionally define a schema to get a DataFrame with optimized physical plans, run real‑time jobs with Spark Streaming using the same code, and apply machine‑learning algorithms via MLlib without moving data around.

Why Does It Run So Slowly?

Even though Spark shortens code compared to MapReduce, jobs can stall if tasks are not well balanced. Two key practices help:

For input or intermediate RDD s, use repartition() to create a suitable number of partitions, typically a multiple of the number of cores.

When using shuffle operations such as reduceByKey, explicitly set the numTasks parameter.

OOM and Key Construction

Long keys (e.g., full HDFS file paths or S3 object keys) can cause out‑of‑memory errors; millions of 200‑byte keys already exceed 1.8 GB. A good solution is to replace long keys with short unique integers or hash strings, converting them back only when necessary.

Even short keys can lead to data skew, where a few tasks receive most of the data. Mitigate this by appending a random number (scaled by the number of cores) to the key before shuffling.

Why So Many Partitions?

Task parallelism depends on the number of partitions. Useful facts:

Partitions are created from InputSplit s defined by the input format; on HDFS each split is at most the block size (default 128 MB), so a file may result in many more partitions than blocks.

If an RDD has more than 2000 partitions, Spark uses a higher‑compression map status implementation ( HighlyCompressedMapStatus).

On S3, a very large number of partitions can make listing objects expensive.

Too Many Configuration Parameters

Deep Spark users often feel overwhelmed by dozens of parameters. The most critical ones are:

Number of executors: --num-executors Number of cores per executor: --executor-cores Memory per executor: --executor-memory Too few cores per executor hurt data locality and cache reuse; too many cores can saturate I/O. A balanced rule of thumb is to avoid a single 64‑core executor or 64 single‑core executors when you have 64 cores total.

When running on YARN, pay special attention to spark.yarn.executor.memoryOverhead, which reserves off‑heap memory for the container; setting it too low (default 10% of executor memory) can cause container kills.

If your workload uses little CPU, you can artificially increase the logical core count by adjusting SPARK_WORKER_CORES in spark-env.sh (standalone mode) or yarn.nodemanager.resource.cpu-vcores in yarn-site.xml (YARN mode). The YARN node‑manager CPU setting should be a multiple of --executor-cores to allow the scheduler to allocate new executors.

Ultimate Question: How to Debug?

When a Spark job behaves strangely, follow this debugging flow:

Find the application ID (e.g., app-20150907095733-0079).

Identify the running stage (e.g., saveAsSequenceFile()).

Locate the task in that stage whose status is RUNNING.

Determine the host machine running the task (e.g., ip-172-31-7-108.cn-north-1.compute.internal).

On that host, run jps -lv | grep spark-application-id to get the task PID.

Run jstack <task-pid> to inspect the thread dump for clues.

Summary

Spark delivers excellent performance when you master its tuning knobs; the right settings can make jobs run hundreds of times faster, saving massive human and financial resources. The article lists only a subset of the pitfalls we have encountered; for deeper study, see the referenced tuning guides.

Tuning Spark

How‑to: Tune Your Apache Spark Job

For teams interested in Spark for ETL, real‑time analytics, or recommendation systems, feel free to contact us at [email protected] .

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Performance Tuning Data Skew Spark Streaming Apache Spark RDD Executor Configuration

Written by

Liulishuo Tech Team

Help everyone become a global citizen!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.