When to Use MySQL Hash Join vs. Nested Loop: A Practical Guide
This article explains MySQL's hash join algorithm, compares it with nested‑loop joins, shows how to read execution steps and EXPLAIN output, discusses extended hash‑join scenarios, and weighs the trade‑offs of performing multi‑table joins inside the database versus in application code.
Hash Join Overview
Since MySQL 8.0.20 the cache‑block nested‑loop join has been removed and replaced by hash join for many cases. When the driving table lacks an index, hash join usually outperforms the old nested‑loop approach because it builds an in‑memory hash table on the smaller table and probes it with the larger table.
MySQL Implementation Example
SELECT * FROM product p INNER JOIN `order` o ON p.id = o.product_id WHERE p.id IN (1,2);Execution steps:
Build a hash table from the small table product on the join column id.
Scan the large table order, compute the hash of product_id, locate the corresponding bucket in the hash table, and match rows.
Combine matching rows into the final result set.
If the optimizer selects a hash join, the EXTRA column of EXPLAIN shows Using where; Using join buffer (hash join). When the hash table exceeds the join_buffer_size limit, MySQL spills to disk and the execution plan changes accordingly.
Extended Hash Join Scenarios
From MySQL 8.0.20 onward, hash join can be used for more than simple equi‑joins. Supported forms include:
Inner non‑equi‑join, e.g., SELECT * FROM t1 JOIN t2 ON t1.c1 < t2.c1 Semijoin, e.g., SELECT * FROM t1 WHERE t1.c1 IN (SELECT t2.c2 FROM t2) Antijoin, e.g.,
SELECT * FROM t2 WHERE NOT EXISTS (SELECT * FROM t1 WHERE t1.c1 = t2.c1)Left outer join and right outer join
Application‑Level Join vs. Database Join
Implementing joins in application code typically involves reading both tables into memory and either using a nested‑loop merge or building a hashmap on the join column, which mirrors the database hash join.
Advantages of moving the join to the application layer:
Application servers can be horizontally scaled, reducing pressure on the database.
For simple merges, code reuse and maintenance may be easier.
Higher availability when the database is under heavy load, as the application can continue processing.
Disadvantages include increased network traffic, higher memory consumption, and the risk of saturating the database’s network interface, especially with large tables. In most realistic workloads, keeping multi‑table joins inside the database yields better performance and resource utilization.
Below are illustrative diagrams from the original article:
Senior Tony
Former senior tech manager at Meituan, ex‑tech director at New Oriental, with experience at JD.com and Qunar; specializes in Java interview coaching and regularly shares hardcore technical content. Runs a video channel of the same name.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
