Databases 16 min read

Deep Dive into ClickHouse Join Implementation and Optimization Techniques

This article examines ClickHouse's join mechanisms, detailing the limitations of standard joins, the advantages of Global joins, and optimization strategies such as hash and merge joins, subquery filtering, and memory considerations, illustrated with SQL examples and source‑code analysis.

Snowball Engineer Team
Snowball Engineer Team
Snowball Engineer Team
Deep Dive into ClickHouse Join Implementation and Optimization Techniques

Background

As the company’s data volume grew across multiple business lines, a unified tag system (Snow+ system) was built to support user, post, and fund tags, enabling fine‑grained operations and personalized services.

ClickHouse Selection

After evaluating alternatives, ClickHouse was chosen for its distributed cluster capability, high‑speed OLAP queries, support for mutable data, and ability to handle both batch imports and real‑time writes.

Join Weaknesses

While ClickHouse excels at single‑table queries, its join performance can degrade dramatically, especially when large tables need to be joined, because the engine lacks a fully fledged shuffle or broadcast join.

SQL Example and Issue

select distinct s.uid
from user_label_2 d
join id_maping m on d.uid = m.sq_id
join user_label_1 s on s.uid = m.dj_id
where s.day = '$lastday' and d.day = '$lastday' and d.k000088 = '稳健型'

The query fails unless the GLOBAL keyword is added to the joins, and even then execution time can reach minutes.

Optimization Techniques

Key optimizations include:

Place the smaller table on the right side of the join.

Filter data early using subqueries to reduce the data set before the join.

Use GLOBAL JOIN to broadcast the right‑hand table once, avoiding the N² read amplification of ordinary distributed joins.

Leverage ClickHouse’s HashJoin (default) and fall back to MergeJoin when memory limits are exceeded, as managed by JoinSwitcher .

Source‑Code Exploration

The execution flow starts in TCPHandler::runImpl , which parses the query, builds an Interpreter , creates a QueryPipeline , and processes the pipeline steps.

During a join, the JoiningTransform is invoked. Its transformHeader method receives a JoinPtr that can be a HashJoin , MergeJoin , or JoinSwitcher implementation.

JoinSwitcher begins with an in‑memory hash join and switches to a disk‑based merge join when limits.softCheck(rows, bytes) reports memory overflow:

bool JoinSwitcher::addJoinedBlock(const Block █, bool)
{
    std::lock_guard lock(switch_mutex);
    if (switched)
        return join->addJoinedBlock(block);
    // HashJoin with external limits check
    join->addJoinedBlock(block, false);
    size_t rows = join->getTotalRowCount();
    size_t bytes = join->getTotalByteCount();
    if (!limits.softCheck(rows, bytes))
        switchJoin();
    return true;
}

The IJoin interface defines addJoinedBlock (build phase) and joinBlock (probe phase), which are used by the pipeline to materialize the right‑hand hash map and then probe it with the left‑hand rows.

Distributed Join Mechanics

ClickHouse distinguishes between ordinary joins and GLOBAL JOIN . In an ordinary join, each shard pulls the full right‑hand table, causing N² reads. A GLOBAL JOIN collects the right‑hand data on a single node, broadcasts it once, and then each shard performs the local join, dramatically reducing network traffic.

Summary of Optimizations

To achieve sub‑second join performance on large datasets, the article recommends:

Using GLOBAL JOIN to avoid read amplification.

Placing the smaller table on the right side to keep the hash map in memory.

Applying early filters and aggregations in subqueries before the join.

Understanding the underlying hash‑join and merge‑join implementations to tune memory limits and join algorithms.

These practices enable efficient, scalable analytics on ClickHouse clusters.

distributed systemsSQLClickHouseHash JoinJoin OptimizationMerge Join
Snowball Engineer Team
Written by

Snowball Engineer Team

Proactivity, efficiency, professionalism, and empathy are the core values of the Snowball Engineer Team; curiosity, passion, and sharing of technology drive their continuous progress.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.