Big Data 13 min read

How SparkSQL Executes Queries Faster Than Hive: A Deep Dive

This article explains SparkSQL's query processing pipeline—from parsing and logical planning through optimization and physical execution—highlighting why it often outperforms Hive on MapReduce by reducing I/O, minimizing shuffle stages, and reusing JVMs.

ITPUB
ITPUB
ITPUB
How SparkSQL Executes Queries Faster Than Hive: A Deep Dive

Spark, developed by UC Berkeley's AMP Lab and commercialized by Databricks, provides a fast in‑memory computation framework, while SparkSQL builds an interactive SQL layer on top of it.

Although many benchmarks claim SparkSQL is faster than Hive on MapReduce, the article walks through a concrete SparkSQL job to show where the speed gains actually come from.

SQL‑on‑Hadoop solutions such as Hive, Impala, Drill, Presto, SparkSQL, and Apache Tajo aim to improve interactive query performance because traditional Hive on MapReduce is notoriously slow.

Using a simple example query that computes a weighted average per department, the article shows how the SQL string is first parsed by a lexer/parser (often ANTLR) into an abstract syntax tree (AST) and then wrapped into a LogicalPlan :

TableScan(students)
  -> Project(dept, avg(math_score * 1.2) + avg(eng_score * 0.8))
  -> TableSink

The initial logical plan is unresolved because column types and schema information are missing. Metadata (table schema, function signatures) is then attached, allowing column‑reference and function binding, type checking, and semantic validation, which yields a resolved logical plan.

Next, the optimizer applies rule‑based transformations (e.g., predicate push‑down, constant folding) and, in some engines, cost‑based decisions such as join ordering. SparkSQL’s cost model is simple, mainly choosing broadcast joins for small tables.

After optimization, the logical plan looks like:

TableScan(students=>dept:String, eng_score:double, math_score:double)
  -> Project(dept, math_score * 1.2:expr1, eng_score * 0.8:expr2)
  -> Aggregate(avg(expr1):expr3, avg(expr2):expr4, GROUP:dept)
  -> Project(dept, expr3+expr4:avg_result)
  -> TableSink(dept, avg_result->Client)

Physical planning then decides how to evaluate expressions. Two strategies exist: interpreted evaluation (used by Hive) and code‑generation (used by SparkSQL, Impala, Drill). An example of generated code for the expression math_score * 1.2 is:

// math_score * 1.2
val leftOp = row.get(1/* math_score column index */);
val result = if (leftOp == null) then null else leftOp * 1.2;

Aggregations like AVG are split into a partial aggregate (computed on each mapper) and a final aggregate (after a shuffle that groups rows by the key). This introduces a ShuffleExchange stage.

The resulting physical plan for the example query is:

TableScan -> Project(dept, math_score * 1.2, eng_score * 0.8)
  -> AggregatePartial(avg(expr1), avg(expr2), GROUP:dept)
  -> ShuffleExchange(Row, KEY:dept)
  -> AggregateFinal(avg1, avg2, GROUP:dept)
  -> Project(dept, avg1 + avg2)
  -> TableSink

In a MapReduce engine, the plan is executed by special mapper and reducer classes that deserialize the plan and evaluate rows. In Spark, the plan becomes a series of RDD transformations. Example mapper code for a Project operation:

void configuration() {
  context = loadContext();
}
void map(inputRow) {
  outputRow = context.projectEvaluator(inputRow);
  write(outputRow);
}

And the equivalent Spark fragment:

currentPlan.mapPartitions { iter =>
  projection = loadContext()
  iter.map { row => projection(row) }
}

Why SparkSQL often outperforms Hive on MapReduce:

It eliminates redundant HDFS reads/writes and extra MapReduce stages, reducing I/O.

Shuffle data is written once and read directly, avoiding the three‑fold replication overhead of HDFS.

Short‑lived tasks reuse JVMs, saving the costly JVM startup time that dominates many Hive jobs.

For simple queries with a single shuffle, the performance gap may be modest, but for complex queries involving multiple joins or sub‑queries, SparkSQL’s streamlined execution model yields significant speedups.

Finally, the article includes a diagram comparing the MR and Spark execution graphs for a more complex join query, illustrating the reduced disk I/O and fewer JVM launches in Spark.

SparkSQL vs Hive execution diagram
SparkSQL vs Hive execution diagram
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Big DataSparkSQLquery optimizationHivedistributed computing
ITPUB
Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.