Big Data 10 min read

Predicate Pushdown Rules in SparkSql Inner Join Queries

SparkSql optimizes inner‑join queries by pushing predicates to the scan phase, allowing filters connected with AND to be applied before the join without changing results, while OR‑connected filters can be unsafe except when they involve the join key or partitioned tables which use partition pruning.

vivo Internet Technology
vivo Internet Technology
vivo Internet Technology
Predicate Pushdown Rules in SparkSql Inner Join Queries

This article provides an in-depth explanation of SparkSql's predicate pushdown optimization in join queries. SparkSql is a distributed SQL engine built on the Spark computing framework, using DataFrame and Dataset to handle structured and semi-structured data queries. It provides both DSL for Scala-based SQL queries and ThriftServer for service-oriented SQL access.

The article first explains the concepts of join queries (inner join, outer join, semi join) and join conditions. A join condition determines when rows from two tables can be joined together, while post-join conditions filter the result after the join operation.

Predicate pushdown in SparkSql has two meanings: (1) which component performs data filtering - the SQL layer's Filter operator or the data source during scan, and (2) when filtering occurs - before or after the join operation. The article details the query processing flow in SparkSql, including analysis (lexical, syntactic, and semantic analysis), logical plan optimization (where predicate pushdown occurs), and physical plan generation.

The core content focuses on predicate pushdown rules for inner join queries:

4.1 Join post-conditions connected by AND: When join post-conditions use AND to connect filter conditions from both tables, predicate pushdown produces the same result as post-join filtering. The article demonstrates this with examples showing that filtering each table first, then joining, yields identical results to joining first then filtering.

4.2 Join post-conditions connected by OR: When OR is used, predicate pushdown can produce incorrect results. For example, with condition "LT.value = 'two' OR RT.value = 'two'", if we filter left table by LT.value='two', rows where LT.value is not 'two' but could match right table's RT.value='two' after join would be incorrectly filtered out. There are two exceptions: (1) when the filter field is exactly the join field, and (2) partition table scenarios.

4.3 Partition tables with OR conditions: For partitioned tables, SparkSql uses "partition pruning" optimization. Instead of treating partition fields as regular filter conditions, it directly excludes directories that don't match the partition criteria from scanning. This requires partition fields to be marked separately in SparkSql's metadata for proper semantic analysis.

Big DataSQL Optimizationdistributed computingJoin OptimizationPartition PruningPredicate PushdownSparkSql
vivo Internet Technology
Written by

vivo Internet Technology

Sharing practical vivo Internet technology insights and salon events, plus the latest industry news and hot conferences.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.