Big Data 14 min read

Common Spark SQL, Spark Core, PySpark, and Streaming Issues and Their Solutions

This article compiles frequent Spark SQL, Spark Core, PySpark, and Streaming problems—such as filesystem errors, configuration pitfalls, memory limits, shuffle failures, and version incompatibilities—along with concise explanations of their causes and step‑by‑step remediation methods for big‑data environments.

Big Data Technology & Architecture

Jul 18, 2020

Common Spark SQL, Spark Core, PySpark, and Streaming Issues and Their Solutions

1. SparkSQL related issues

FileSystem closed error in insert statements caused by Hadoop FileSystem caching; fix by disabling cache in hdfs-site.xml with fs.hdfs.impl.disable.cache=true.

UnresolvedAddressException when connecting to a host due to missing hosts entry; add the correct mapping to the /etc/hosts file.

ORC table errors (IndexOutOfBoundsException, NullPointerException) caused by empty ORC files; avoid by setting hive.exec.orc.split.strategy=BI.

Spark 2.1.0 lacks permanent functions because it cannot load JARs from HDFS; upgrade to Spark 2.2+.

ThriftServer timeout errors (socket timeout) due to busy Hive metastore or GC; increase hive.metastore.client.socket.timeout or set DriverManager.setLoginTimeout(100).

Snappy compression errors because the native library is missing; add the library path via spark.executor.extraLibraryPath or spark.executor.extraJavaOptions.

Excessive task count for small files caused by default partition calculation; adjust mapreduce.job.maps or spark.default.parallelism to reduce tasks.

ThriftServer LDAP authentication failures caused by wrong password or LDAP service issues; correct credentials or fix LDAP service.

JDBC operations failing due to missing or unwritable temporary directories; restart ThriftServer and set proper spark.local.dir permissions.

StackOverflowError for complex SQL caused by insufficient JVM stack size; launch SparkSQL with --driver-java-options "-Xss10m".

INSERT INTO repeated execution bug in Spark 2.1.0; upgrade to 2.1.1 or use INSERT OVERWRITE without partition.

Shuffle-related failures (missing output location, connection errors, OOM, container killed) caused by improper shuffle partition settings; tune spark.sql.shuffle.partitions, spark.default.parallelism, executor memory, and check for data skew.

Executor OOM due to GC overhead; increase executor memory and use G1GC via spark.executor.extraJavaOptions -XX:+UseG1GC.

ORC access permission errors in HiveServer2/Spark ThriftServer; use a superuser for first query or disable caching with hive.fetch.task.conversion=none.

Slow small‑data queries because of default spark.locality.wait=3s; set it to 0 for low‑latency workloads.

2. Spark Core related issues

Jersey package conflict on YARN leading to NoClassDefFoundError; disable timeline service with --conf spark.hadoop.yarn.timeline-service.enabled=false.

"No space left on device" errors due to full Spark temporary directories; enlarge spark.local.dir or add multiple directories.

Result size exceeds spark.driver.maxResultSize; increase the configuration value.

Common OOM (Java heap space) caused by large data or many partitions; avoid collect, address data skew, and increase executor memory.

Executor loss from Full GC or long‑running tasks; increase executor memory, adjust spark.network.timeout, or tune GC.

Jar version conflicts causing ClassNotFoundException; set spark.driver.userClassPathFirst and spark.executor.userClassPathFirst to true.

Shuffle fetch OOM; increase executor memory and reduce spark.reduce.maxSizeInFlight (default 48 MB).

Direct buffer memory errors; raise -XX:MaxDirectMemorySize via spark.executor.extraJavaOptions.

Node failures (e.g., read‑only disk) trigger blacklist; enable spark.blacklist.enabled=true.

3. PySpark related issues

Driver and executor Python versions mismatch; set spark.pyspark.python and environment variables PYSPARK_PYTHON, PYSPARK_DRIVER_PYTHON to the same interpreter.

Serialization pickle EOFError on YARN due to conflicting Spark installations; remove the extra Spark path from the NodeManager.

Random hash seed differences causing PYTHONHASHSEED errors; set spark.executorEnv.PYTHONHASHSEED to a fixed value.

4. Streaming related issues

Kafka consumer reads all existing messages because auto.offset.reset=earliest; change to latest or specify offset ranges.

Performance tips: prefer reduceByKey / aggregateByKey over groupByKey, use mapPartitions, foreachPartitions, coalesce after filters, and repartitionAndSortWithinPartitions instead of separate repartition and sort.

Batch lag and HBase RegionTooBusyException; lower spark.streaming.kafka.maxRatePerPartition, tune storage, and enable back‑pressure with spark.streaming.backpressure.enabled.

Kafka OffsetOutOfRangeException; adjust offsets to the earliest or latest available range.

Kafka leader not found; increase spark.streaming.kafka.maxRetries above 1.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data SQL Streaming Troubleshooting Spark PySpark

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.