Big Data 23 min read

Spark SQL Parameter Tuning and Performance Optimization (Spark 2.3.2)

This article explains how to troubleshoot and tune Spark SQL configuration parameters—covering exception‑related settings such as spark.sql.hive.convertMetastoreParquet, file‑ignore options, and partition verification, as well as performance‑focused tweaks like broadcast join thresholds, adaptive execution, and parquet schema merging—while providing a comprehensive parameter reference table.

Big Data Technology & Architecture

Aug 12, 2019

Spark SQL Parameter Tuning and Performance Optimization (Spark 2.3.2)

Introduction Spark SQL exposes many configuration parameters that are not fully documented on the official site. Using set -v in the Spark shell can list all supported options. The article discusses practical tuning cases encountered when migrating Hive workloads to Spark.

Exception‑related tuning

spark.sql.hive.convertMetastoreParquet

When true (default), Spark uses its built‑in Parquet reader/writer, offering better performance. Setting it to false forces Hive’s SerDe, which can cause ClassCastException errors when the underlying data types differ (e.g., LongWritable vs. IntWritable).

java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be cast to org.apache.hadoop.io.IntWritable
    at org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableIntObjectInspector.get(WritableIntObjectInspector.java:36)

The related spark.sql.hive.convertMetastoreParquet.mergeSchema flag controls whether Spark attempts to merge schemas from multiple Parquet files.

spark.sql.files.ignoreMissingFiles & spark.sql.files.ignoreCorruptFiles

These flags apply only to DataSource tables. When true, Spark silently skips missing or corrupted files instead of throwing exceptions. The source logic essentially catches FileNotFoundException and IOException and logs a warning if the corresponding flag is enabled.

catch {
    case e: FileNotFoundException if ignoreMissingFiles =>
        logWarning(s"Skipped missing file: $currentFile", e)
        finished = true
        null
    case e: FileNotFoundException if !ignoreMissingFiles => throw e
    case e @ (_: RuntimeException | _: IOException) if ignoreCorruptFiles =>
        logWarning(s"Skipped the rest of the content in the corrupted file: $currentFile", e)
        finished = true
        null
}

spark.sql.hive.verifyPartitionPath

By default false, this parameter validates the existence of partition directories before reading. Enabling it prevents FileNotFoundException errors when a partition path is missing.

java.io.FileNotFoundException: File does not exist: hdfs://.../day=2019-06-25/os=Android/000067_0

spark.files.ignoreCorruptFiles & spark.files.ignoreMissingFiles

These are the equivalents for the older spark.sql.files.* settings and take effect when Spark reads DataSource tables; they do not apply to Hive tables, which use HadoopRDD paths.

Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 107052 in stage 914.0 failed 4 times
    ...
    java.io.FileNotFoundException: File does not exist: hdfs://.../day=2019-06-25/os=Android/000067

Performance tuning

spark.hadoopRDD.ignoreEmptySplits

When true, empty splits are ignored, reducing the number of tasks.

spark.hadoop.mapreduce.input.fileinputformat.split.minsize

Controls the minimum size of input splits to avoid creating too many small map tasks.

Broadcast join thresholds

spark.sql.autoBroadcastJoinThreshold

and spark.sql.broadcastTimeout adjust the size limit for broadcast joins and the timeout for the broadcast future, respectively.

Adaptive execution

Enabling spark.sql.adaptive.enabled activates Spark’s adaptive query execution. The spark.sql.adaptive.shuffle.targetPostShuffleInputSize parameter limits the average input size per task after shuffle to prevent task explosion.

Parquet schema handling

spark.sql.parquet.mergeSchema

(default false) decides whether Spark merges schemas from all Parquet files. Other related settings include spark.sql.parquet.binaryAsString, spark.sql.parquet.int96AsTimestamp, and spark.sql.parquet.outputTimestampType for compatibility with external systems.

Other notable settings

spark.sql.files.maxPartitionBytes

– maximum bytes per partition. spark.sql.shuffle.partitions – default number of shuffle partitions (4096). spark.sql.sources.default – default data source (parquet). spark.sql.session.timeZone – session time zone (Asia/Shanghai).

Parameter reference table

The article concludes with a detailed table listing key Spark‑SQL parameters, their default values, and brief meanings, covering adaptive execution, broadcast joins, Hive integration, Parquet/ORC handling, statistics, streaming, and UI settings.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Optimization Big Data SQL parameter tuning Spark Hive Migration

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.