Big Data 7 min read

Predicate Pushdown Rules in SparkSQL Outer Join Queries – Detailed Analysis

The article examines SparkSQL’s predicate‑pushdown behavior for left outer joins, detailing four rules that show when pushing join‑condition filters to the left or right tables yields correct, faster results and when it produces incorrect outcomes, highlighting both performance gains and subtle errors.

vivo Internet Technology
vivo Internet Technology
vivo Internet Technology
Predicate Pushdown Rules in SparkSQL Outer Join Queries – Detailed Analysis

In the first part of the series we introduced basic concepts of predicate pushdown and analyzed the rules for inner‑join queries. This second article focuses on the more complex rules for outer‑join queries in SparkSQL.

We start with a table summarizing the pushdown rules for left outer joins (see image below).

The first rule is Left‑table join condition pushdown . After applying the post‑join filter LT.id > 1 to the left table, 50% of the left‑table rows are eliminated, and the subsequent join produces the same result as applying the filter after the join.

The second rule shows the case where the left‑table join condition is not pushed down. The query is rewritten as a non‑correlated sub‑query, and the result differs from the correct answer, demonstrating that pushdown of join‑condition filters can produce wrong results.

The third rule examines right‑table join condition pushdown . After filtering the right table with RT.id > 1 , the left‑table rows are joined with the reduced right‑table, yielding the same final result as the non‑pushed‑down case, but with a potential performance gain.

The fourth rule demonstrates a scenario where pushing down the right‑table join condition leads to an incorrect result, because the join condition and the post‑join filter behave differently with respect to row preservation.

Overall, the analysis shows that SparkSQL does not always push outer‑join filters down to the data source. Correctly applied pushdown can dramatically improve query performance, while incorrect pushdown may produce subtle, wrong results.

The article concludes with a preview of the next installment, which will discuss how various storage engines (RDBMS, NoSQL, columnar stores, search engines, HDFS) implement efficient predicate pushdown.

Big DataSparkSQLQuery OptimizationOuter JoinPredicate Pushdown
vivo Internet Technology
Written by

vivo Internet Technology

Sharing practical vivo Internet technology insights and salon events, plus the latest industry news and hot conferences.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.