Big Data 15 min read

How to Avoid Common Spark SQL Pitfalls and Boost Performance

This article shares a comprehensive set of practical tips and solutions for common Spark SQL issues—including out‑of‑memory errors, UDF‑induced GC, thread blocking, system‑property initialization, speculation side‑effects, accumulator traps, concurrent job scheduling, and excessive logging—helping engineers improve stability and efficiency of their Spark‑based financial systems.

dbaplus Community
dbaplus Community
dbaplus Community
How to Avoid Common Spark SQL Pitfalls and Boost Performance

Problem 1: Out of Memory During Job Execution

When Spark executors lack sufficient memory, YARN kills them, causing task failures. The article lists four mitigation strategies:

Ensure data partitions are evenly distributed and increase the number of partitions while reducing the size of each partition.

Explicitly release cached RDDs using RDD.unpersist() to free memory.

Adjust RDD persistence with rdd.persist(StorageLevel) to serialize data to disk, trading speed for stability.

If the above are insufficient, tune executor memory allocation or the memoryOverhead parameter, though results may vary.

Note: rdd.cache() is equivalent to rdd.persist(StorageLevel.MEMORY_ONLY); when memory is insufficient, cached data is lost and recomputed. Using MEMORY_AND_DISK_SER stores overflow data on disk, avoiding recomputation at the cost of I/O.

Relevant references for Yarn container concepts and configuration:

http://dongxicheng.org/mapreduce-nextgen/understand-yarn-container-concept/

http://ju.outofmemory.cn/entry/199175

https://yq.aliyun.com/articles/25468

Problem 2: Full GC Caused by Instance Variable in a UDF

A custom Spark SQL UDF ( CuxApVendorSetupUDF) held a large map (hundreds of thousands of entries) as an instance variable, which was repeatedly created in each task, leading to a spike in GC time.

Solution: Broadcast the map so that it is stored once per executor and shared across tasks, eliminating repeated construction.

Problem 3: Thread Blocking Due to Excessive TCP Connections

During a job run, many executor threads were blocked on socket communication, preventing the Spark UI from retrieving executor information.

Root causes identified:

Insufficient executor memory or CPU was ruled out because the cluster had idle resources.

Frequent opening/closing of MySQL connections caused a large number of sockets, exhausting the OS limit.

Problem 4: System Properties Not Initialized Correctly

Java developers stored configuration in OS environment variables, requiring a full Spark cluster restart for any change. The team switched to reading from application config files while still supporting environment variables, then wrote the combined settings into System.Property. However, code that set properties only on the driver did not propagate to executors.

Fix: Ensure System.setProperty is executed on each executor before tasks run.

Problem 5: Caution When Using Spark.speculation for Database Writes

Speculation may duplicate slow tasks on other executors, potentially causing duplicate writes to a database. In the author's case, unique indexes prevented dirty data, but the job still failed due to write errors.

Note: The default value of Spark.speculation is false (disabled).

When enabled, Spark may launch a duplicate task without waiting for the original to finish; the first successful result is used.

Problem 6: Monitoring Target Storage State to Prevent Failures

High‑throughput writes to MySQL (up to 200,000 UPDATEs per second) caused CPU saturation and task failures. The team limited write throughput to around 10,000 TPS and applied several mitigations:

Control the number of concurrent Spark jobs.

Reduce the number of partitions, thereby limiting executor and task parallelism.

Combine multiple UPDATE statements into batch SQL to lower command count.

Problem 7: org.apache.spark.SparkException: Task not serializable

Using non‑serializable external objects inside closures causes this exception. The article lists four ways to resolve it:

Make the object and all referenced objects implement java.io.Serializable.

Create the object inside rdd.foreachPartition rather than capturing it.

Define the object as a static member.

Avoid accessing fields of external objects from within the task.

Problem 8: Accumulator Traps

Accumulators are useful for lightweight debugging but can yield incorrect results if misused. Guidelines:

Use an accumulator only once per RDD action; if needed multiple times, cache or persist the RDD to break lineage.

When a stage fails and tasks are retried, unhandled accumulators may be updated multiple times, producing wrong values.

If speculative execution is enabled, the same task may run on multiple nodes, causing duplicate accumulator updates.

Problem 9: Running Multiple Spark Jobs Simultaneously on a Small Cluster

The cluster allowed only one job at a time, even when resources were idle, limiting throughput for high‑priority tasks. The team adopted YARN queue‑based resource sharing with the following principles:

Create queues based on business function, priority, and required resources.

Each queue should reserve minimal resources, ideally enough for a single executor.

The sum of maximum capacities of high‑priority queues must not exceed total cluster resources.

Problem 10: Excessive Spark Log Output Blocking Log Collection System

Default Spark log level (INFO) generates massive logs, overwhelming the Logstash‑based log collection pipeline and causing client crashes. The solution is to set Spark's log level to WARN while keeping business‑specific logs at INFO. Because Spark runs in YARN mode, the custom log4j.properties must be placed in the cluster's configuration directory.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Memory Managementperformance tuningUDFYARNSparkSpark SQLAccumulator
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.