Common Spark SQL, Spark Core, PySpark, and Streaming Issues and Their Solutions
This article compiles frequent Spark SQL, Spark Core, PySpark, and Streaming problems—such as filesystem errors, configuration pitfalls, memory limits, shuffle failures, and version incompatibilities—along with concise explanations of their causes and step‑by‑step remediation methods for big‑data environments.
1. SparkSQL related issues
FileSystem closed error in insert statements caused by Hadoop FileSystem caching; fix by disabling cache in hdfs-site.xml with fs.hdfs.impl.disable.cache=true.
UnresolvedAddressException when connecting to a host due to missing hosts entry; add the correct mapping to the /etc/hosts file.
ORC table errors (IndexOutOfBoundsException, NullPointerException) caused by empty ORC files; avoid by setting hive.exec.orc.split.strategy=BI.
Spark 2.1.0 lacks permanent functions because it cannot load JARs from HDFS; upgrade to Spark 2.2+.
ThriftServer timeout errors (socket timeout) due to busy Hive metastore or GC; increase hive.metastore.client.socket.timeout or set DriverManager.setLoginTimeout(100).
Snappy compression errors because the native library is missing; add the library path via spark.executor.extraLibraryPath or spark.executor.extraJavaOptions.
Excessive task count for small files caused by default partition calculation; adjust mapreduce.job.maps or spark.default.parallelism to reduce tasks.
ThriftServer LDAP authentication failures caused by wrong password or LDAP service issues; correct credentials or fix LDAP service.
JDBC operations failing due to missing or unwritable temporary directories; restart ThriftServer and set proper spark.local.dir permissions.
StackOverflowError for complex SQL caused by insufficient JVM stack size; launch SparkSQL with --driver-java-options "-Xss10m".
INSERT INTO repeated execution bug in Spark 2.1.0; upgrade to 2.1.1 or use INSERT OVERWRITE without partition.
Shuffle-related failures (missing output location, connection errors, OOM, container killed) caused by improper shuffle partition settings; tune spark.sql.shuffle.partitions, spark.default.parallelism, executor memory, and check for data skew.
Executor OOM due to GC overhead; increase executor memory and use G1GC via spark.executor.extraJavaOptions -XX:+UseG1GC.
ORC access permission errors in HiveServer2/Spark ThriftServer; use a superuser for first query or disable caching with hive.fetch.task.conversion=none.
Slow small‑data queries because of default spark.locality.wait=3s; set it to 0 for low‑latency workloads.
2. Spark Core related issues
Jersey package conflict on YARN leading to NoClassDefFoundError; disable timeline service with --conf spark.hadoop.yarn.timeline-service.enabled=false.
"No space left on device" errors due to full Spark temporary directories; enlarge spark.local.dir or add multiple directories.
Result size exceeds spark.driver.maxResultSize; increase the configuration value.
Common OOM (Java heap space) caused by large data or many partitions; avoid collect, address data skew, and increase executor memory.
Executor loss from Full GC or long‑running tasks; increase executor memory, adjust spark.network.timeout, or tune GC.
Jar version conflicts causing ClassNotFoundException; set spark.driver.userClassPathFirst and spark.executor.userClassPathFirst to true.
Shuffle fetch OOM; increase executor memory and reduce spark.reduce.maxSizeInFlight (default 48 MB).
Direct buffer memory errors; raise -XX:MaxDirectMemorySize via spark.executor.extraJavaOptions.
Node failures (e.g., read‑only disk) trigger blacklist; enable spark.blacklist.enabled=true.
3. PySpark related issues
Driver and executor Python versions mismatch; set spark.pyspark.python and environment variables PYSPARK_PYTHON, PYSPARK_DRIVER_PYTHON to the same interpreter.
Serialization pickle EOFError on YARN due to conflicting Spark installations; remove the extra Spark path from the NodeManager.
Random hash seed differences causing PYTHONHASHSEED errors; set spark.executorEnv.PYTHONHASHSEED to a fixed value.
4. Streaming related issues
Kafka consumer reads all existing messages because auto.offset.reset=earliest; change to latest or specify offset ranges.
Performance tips: prefer reduceByKey / aggregateByKey over groupByKey, use mapPartitions, foreachPartitions, coalesce after filters, and repartitionAndSortWithinPartitions instead of separate repartition and sort.
Batch lag and HBase RegionTooBusyException; lower spark.streaming.kafka.maxRatePerPartition, tune storage, and enable back‑pressure with spark.streaming.backpressure.enabled.
Kafka OffsetOutOfRangeException; adjust offsets to the earliest or latest available range.
Kafka leader not found; increase spark.streaming.kafka.maxRetries above 1.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
