Big Data 40 min read

Spark SQL Interview Guide: Concepts, APIs, Optimization and Common Pitfalls

This article provides a comprehensive overview of Spark SQL, covering its architecture, DataSet/DataFrame APIs, code examples for creating and querying datasets, join strategy selection, handling Hive tables, small‑file issues, inefficient NOT‑IN subqueries, Cartesian products, and a catalog of useful built‑in functions.

Big Data Technology & Architecture

Aug 15, 2021

Spark SQL Interview Guide: Concepts, APIs, Optimization and Common Pitfalls

Spark SQL is a Spark component for processing structured data, originally derived from Shark but redesigned to remove Hive dependencies while retaining compatibility, offering in‑memory columnar storage, bytecode generation, cost‑based and rule‑based optimizations, and support for multiple data sources such as JDBC, HDFS, and HBase.

DataSet and DataFrame are distributed collections provided by Spark SQL; DataSet retains strong typing and lambda support (Scala/Java) while DataFrame is a type‑alias for DataSet[Row] with a schema similar to a relational table, both using the Catalyst optimizer.

DataSet creation examples include loading JSON or MySQL data and converting RDDs to DataSets by defining case classes or explicit schemas:

val ds = sparkSession.read.json("/path/people.json")

val ds = sparkSession.read.format("jdbc")
  .options(Map(
    "url" -> "jdbc:mysql://ip:port/db",
    "driver" -> "com.mysql.jdbc.Driver",
    "dbtable" -> "tableName",
    "user" -> "root",
    "password" -> "123"))
  .load()

// RDD to DataSet via case class
case class Person(id:Int, name:String, age:Int)
val lineRDD = sparkContext.textFile("hdfs://ip:port/person.txt").map(_.split(" "))
val personRDD = lineRDD.map(x => Person(x(0).toInt, x(1), x(2).toInt))
val ds = personRDD.toDF

// RDD to DataSet via manual schema
val schemaString = "name age"
val schema = StructType(schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, true)))
val rowRdd = peopleRdd.map(p => Row(p(0), p(1)))
val ds = sparkSession.createDataFrame(rowRdd, schema)

Two query styles are supported: DSL syntax using DataSet methods (select, filter, groupBy, etc.) and SQL syntax after registering a temporary view:

personDS.select(col("name"), col("age"))
personDS.filter(col("age") > 18)
personDS.groupBy("age").count()

personDS.registerTempTable("person")
val result = sparkSession.sql("select * from person order by age desc limit 2")
result.write.format("json").save("hdfs://ip:port/res2")

Spark SQL offers three main join strategies—Broadcast Hash Join, Shuffle Hash Join, and Sort‑Merge Join—selected based on table size, broadcast hints, and key orderability, with fallback to Broadcast Nested Loop Join or CartesianProduct when no equi‑join keys exist.

UDF, UDAF, and typed Aggregator examples illustrate how to register custom functions and aggregate logic:

val udf_str_length = udf{(str:String) => str.length}
spark.udf.register("str_length", udf_str_length)

object MyAverage extends UserDefinedAggregateFunction { ... }
spark.udf.register("myAverage", MyAverage)

case class Employee(name:String, salary:Long)
case class Average(var sum:Long, var count:Long)
object MyAverage extends Aggregator[Employee, Average, Double] { ... }
val averageSalary = MyAverage.toColumn.name("average_salary")

When reading Hive tables stored as Parquet, Spark SQL may use its own SerDe; compatibility issues such as case sensitivity and nullability require schema alignment and explicit refresh of cached metadata using sparkSession.catalog.refreshTable or refreshByPath.

Common performance pitfalls include the small‑file problem (excessive HDFS files causing driver OOM or task explosion) mitigated by repartition / coalesce, Hive‑style hints, or periodic file compaction, and inefficient NOT‑IN subqueries that trigger Broadcast Nested Loop Joins, which can be avoided by rewriting queries or pre‑execution plan analysis.

Cartesian products arise from joins without ON conditions, non‑equi joins, or misuse of OR/|| in join predicates; detection relies on examining logical and physical plans for CartesianProductExec or BroadcastNestedLoopJoinExec nodes.

The article concludes with a catalog of useful Spark SQL functions covering string manipulation, JSON handling, date/time extraction and conversion, and window functions such as cume_dist, lead, lag, rank, and row_number.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Optimization UDF DataFrame Dataset Spark SQL Hive Integration join strategies

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.