Databases 17 min read

Mastering Subquery Unnesting: Turn Correlated Queries into Fast Joins

This article explains why correlated subqueries are a performance bottleneck, introduces the Apply (Correlated Join) operator and a systematic set of transformation rules that push Apply down and replace subqueries with efficient joins, covering scalar, existential, and quantified subqueries, aggregation handling, and set operations, all illustrated with TPC‑H examples.

Open Source Linux

May 5, 2023

Mastering Subquery Unnesting: Turn Correlated Queries into Fast Joins

Subquery (Subquery) optimization has long been a difficult point in SQL query optimization. Correlated subqueries are executed like a Nested‑Loop, which becomes intolerably slow as data grows, so they must be decorrelated (de‑correlation or Unnesting) into more efficient operators such as Semi‑Join.

The discussion is based on classic papers from SQL Server and HyPer and uses TPC‑H tables for examples.

Subquery Overview

A subquery is a syntax defined in the SQL standard that can appear in almost any clause (SELECT, FROM, WHERE, etc.). Subqueries are divided into:

Correlated Subquery – depends on parameters from the outer query.

Non‑correlated Subquery – can be executed independently and materialised before the outer query.

Only correlated subqueries are considered in this article.

Correlated subqueries are further classified by their result shape:

Scalar‑valued – returns a single row and column; if empty, returns NULL. Example:

SELECT c_custkey
FROM CUSTOMER
WHERE 1000000 < (
    SELECT SUM(o_totalprice)
    FROM ORDERS
    WHERE o_custkey = c_custkey
);

Existential Test – EXISTS subquery returning a Boolean value (a Semi‑Join). Example:

SELECT c_custkey
FROM CUSTOMER
WHERE c_nationkey = 86 AND EXISTS(
    SELECT * FROM ORDERS
    WHERE o_custkey = c_custkey
);

Quantified Comparison – IN, SOME, ANY, ALL forms returning a Boolean value. Example:

SELECT c_name
FROM CUSTOMER
WHERE c_nationkey <> ALL (SELECT s_nationkey FROM SUPPLIER);

Original Execution Plan

Using Query 1 as an example, the original plan shows the subquery attached under a Filter expression, causing the executor to invoke the evaluator for each outer row, which is extremely inefficient.

Apply Operator

The Apply operator (also called Correlated Join) receives two relation trees; the inner tree may contain parameters from the outer row. Four forms exist:

Cross Apply (A×A×)

Left Outer Apply (ALOJ)

Semi Apply (A∃)

Anti‑Semi Apply (A∄)

Basic Decorrelation Rules

The first group of rules removes Apply when its inner side does not reference outer parameters, turning it into a regular Join. Applying these rules to Query 1 shows why decorrelation is essential.

Project and Filter Decorrelation

The second rule set pushes Apply down and lifts operators such as Project and Filter above it, turning correlated expressions into ordinary variables that can be joined efficiently.

Aggregate Decorrelation

When a correlated subquery contains aggregation, the goal is still to push Apply down. ScalarAgg can be transformed into GroupAgg by grouping on the outer key. However, care must be taken with NULL handling because F(∅) ≠ F({NULL}) for some aggregates. The article suggests replacing COUNT(*) with COUNT(non‑null‑key) or splitting expressions that use IF_NULL before aggregation.

Example of a problematic transformation:

SELECT c_custkey,
    (SELECT COUNT(*) FROM ORDERS WHERE o_custkey = c_custkey) AS count_orders
FROM CUSTOMER;

After transformation without proper NULL handling, a customer with no orders would incorrectly receive a count of 1.

Set‑Operation Decorrelation

The final rule group handles subqueries that appear under UNION ALL, EXCEPT ALL, or Inner Join, again pushing Apply down as far as possible.

FAQ

Can any correlated subquery be decorrelated? Yes, with a few mild restrictions, theory shows that any correlated subquery can be transformed.

Differences between HyPer and SQL Server? HyPer covers more cases and provides explicit transformation rules for many operators, while SQL Server’s papers prove that such cases can be reduced to a manageable form.

Should these rules be applied in RBO or CBO? Not always; the optimizer should estimate costs and choose the best plan, sometimes keeping Apply when the outer side is tiny or the inner side has a suitable index.

References

Parameterized Queries and Nesting Equivalencies – C. Galindo‑Legaria

Orthogonal Optimization of Subqueries and Aggregation – C. Galindo‑Legaria, M. Joshi

Unnesting Arbitrary Queries – T. Neumann, A. Kemper

The Complete Story of Joins (in HyPer) – T. Neumann, V. Leis, A. Kemper

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

SQL query optimization TPC-H subquery Database Engines Apply Operator Decorrelate

Written by

Open Source Linux

Focused on sharing Linux/Unix content, covering fundamentals, system development, network programming, automation/operations, cloud computing, and related professional knowledge.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.