Databases 24 min read

Mastering Correlated Subqueries: Decorrelation, Push‑Down Rules, and Optimization Techniques

This article explains what correlated subqueries are, why they challenge database engines, and how they can be transformed into ordinary SQL using decorrelation, push‑down rules, result reuse, and window‑function tricks, illustrated with TPCH examples and code snippets.

Alibaba Cloud Developer
Alibaba Cloud Developer
Alibaba Cloud Developer
Mastering Correlated Subqueries: Decorrelation, Push‑Down Rules, and Optimization Techniques

Background

This article introduces correlated subqueries, explains their semantics, benefits for writing concise SQL, and the execution challenges they pose to database systems.

Correlated Subquery Definition

A correlated subquery references columns from the outer query, e.g., using

NOT EXISTS (SELECT * FROM orders WHERE o_custkey = c_custkey)

to find customers who have never placed an order.

-- All customers who have never placed an order
SELECT c_custkey
FROM customer
WHERE NOT EXISTS (
    SELECT *
    FROM orders
    WHERE o_custkey = c_custkey
);

Without correlation, the same result requires a LEFT JOIN and additional filtering.

SELECT c_custkey
FROM customer
LEFT JOIN (
    SELECT DISTINCT o_custkey
    FROM orders
) AS o ON o.o_custkey = c_custkey
WHERE o.o_custkey IS NULL;

Correlated Join Operator

The optimizer represents a correlated subquery as a CorrelatedJoin (also called lateral or dependent join). Its left child is the outer query, the right child is the subquery, and the correlated columns are passed from left to right. Execution is iterative: for each outer row the subquery is evaluated with the current correlated values.

Decorrelation Techniques

1. Push‑Down Rules

CorrelatedJoin can be pushed past operators (FILTER, PROJECT, AGGREGATION) when possible, converting it into ordinary joins. For example, if neither side contains correlated columns, the operator becomes a left join without join predicates.

-- Example of pushing CorrelatedJoin through a filter
SELECT ... FROM outer
WHERE outer.col < (SELECT MAX(inner.val) FROM inner WHERE inner.key = outer.key);

After pushing, the subquery can be rewritten as a regular join.

2. Simple Correlation Conversion

When the subquery’s result is independent of the outer row, the CorrelatedJoin is replaced by a cross‑join or a constant‑value join.

3. Push‑Through Project and Aggregation

When pushing through a PROJECT, the outer columns used by the subquery must be added to the projection. When pushing through an AGGREGATION, the grouping keys are extended with the outer keys, and a enforceSingleRow node is added to preserve semantics.

Result Reuse

Because the subquery result for the same correlated value is identical, the optimizer can compute the subquery once per distinct key and reuse the result, e.g., by materializing a distinct set of keys and joining back.

Window‑Function Based Decorrelation

CorrelatedJoin behaves like a window operator: it adds a column without changing row count. Some subqueries can be rewritten using window aggregates, reducing table scans. Example:

SELECT c_custkey,
       SUM(o_totalprice) OVER (PARTITION BY c_custkey) AS total_spent
FROM customer LEFT JOIN orders ON o_custkey = c_custkey;

This approach saves scans but may hinder further push‑down optimizations.

Practical Considerations and Pitfalls

Key issues include handling NULLs correctly (NULL in empty set vs. NULL in non‑empty set), avoiding the “count bug” where decorrelation changes row filtering, and ensuring deterministic subquery results for distributed execution. The article also discusses left‑mark‑join semantics in distributed environments.

Conclusion

Correlated subqueries provide expressive power but require careful optimizer transformations—push‑down, decorrelation, result reuse, and window‑based rewrites—to achieve efficient execution in modern distributed SQL engines.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

SQLdatabasequery optimizationDecorrelationCorrelated Subquery
Alibaba Cloud Developer
Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.