Spark SQL Parameter Tuning and Performance Optimization (Spark 2.3.2)
This article explains how to troubleshoot and tune Spark SQL configuration parameters—covering exception‑related settings such as spark.sql.hive.convertMetastoreParquet, file‑ignore options, and partition verification, as well as performance‑focused tweaks like broadcast join thresholds, adaptive execution, and parquet schema merging—while providing a comprehensive parameter reference table.
Introduction Spark SQL exposes many configuration parameters that are not fully documented on the official site. Using set -v in the Spark shell can list all supported options. The article discusses practical tuning cases encountered when migrating Hive workloads to Spark.
Exception‑related tuning
spark.sql.hive.convertMetastoreParquet
When true (default), Spark uses its built‑in Parquet reader/writer, offering better performance. Setting it to false forces Hive’s SerDe, which can cause ClassCastException errors when the underlying data types differ (e.g., LongWritable vs. IntWritable).
java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be cast to org.apache.hadoop.io.IntWritable
at org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableIntObjectInspector.get(WritableIntObjectInspector.java:36)The related spark.sql.hive.convertMetastoreParquet.mergeSchema flag controls whether Spark attempts to merge schemas from multiple Parquet files.
spark.sql.files.ignoreMissingFiles & spark.sql.files.ignoreCorruptFiles
These flags apply only to DataSource tables. When true, Spark silently skips missing or corrupted files instead of throwing exceptions. The source logic essentially catches FileNotFoundException and IOException and logs a warning if the corresponding flag is enabled.
catch {
case e: FileNotFoundException if ignoreMissingFiles =>
logWarning(s"Skipped missing file: $currentFile", e)
finished = true
null
case e: FileNotFoundException if !ignoreMissingFiles => throw e
case e @ (_: RuntimeException | _: IOException) if ignoreCorruptFiles =>
logWarning(s"Skipped the rest of the content in the corrupted file: $currentFile", e)
finished = true
null
}spark.sql.hive.verifyPartitionPath
By default false, this parameter validates the existence of partition directories before reading. Enabling it prevents FileNotFoundException errors when a partition path is missing.
java.io.FileNotFoundException: File does not exist: hdfs://.../day=2019-06-25/os=Android/000067_0spark.files.ignoreCorruptFiles & spark.files.ignoreMissingFiles
These are the equivalents for the older spark.sql.files.* settings and take effect when Spark reads DataSource tables; they do not apply to Hive tables, which use HadoopRDD paths.
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 107052 in stage 914.0 failed 4 times
...
java.io.FileNotFoundException: File does not exist: hdfs://.../day=2019-06-25/os=Android/000067Performance tuning
spark.hadoopRDD.ignoreEmptySplits
When true, empty splits are ignored, reducing the number of tasks.
spark.hadoop.mapreduce.input.fileinputformat.split.minsize
Controls the minimum size of input splits to avoid creating too many small map tasks.
Broadcast join thresholds
spark.sql.autoBroadcastJoinThresholdand spark.sql.broadcastTimeout adjust the size limit for broadcast joins and the timeout for the broadcast future, respectively.
Adaptive execution
Enabling spark.sql.adaptive.enabled activates Spark’s adaptive query execution. The spark.sql.adaptive.shuffle.targetPostShuffleInputSize parameter limits the average input size per task after shuffle to prevent task explosion.
Parquet schema handling
spark.sql.parquet.mergeSchema(default false) decides whether Spark merges schemas from all Parquet files. Other related settings include spark.sql.parquet.binaryAsString, spark.sql.parquet.int96AsTimestamp, and spark.sql.parquet.outputTimestampType for compatibility with external systems.
Other notable settings
spark.sql.files.maxPartitionBytes– maximum bytes per partition. spark.sql.shuffle.partitions – default number of shuffle partitions (4096). spark.sql.sources.default – default data source (parquet). spark.sql.session.timeZone – session time zone (Asia/Shanghai).
Parameter reference table
The article concludes with a detailed table listing key Spark‑SQL parameters, their default values, and brief meanings, covering adaptive execution, broadcast joins, Hive integration, Parquet/ORC handling, statistics, streaming, and UI settings.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
