Databases 21 min read

Optimizing Complex Queries in ClickHouse: Multi‑Stage Execution, Exchange Management, and Runtime Filters

This article explains how ByteHouse, a heavily optimized ClickHouse variant, addresses complex query challenges by introducing a multi‑stage execution model, sophisticated exchange management, various join strategies, runtime filters, and diagnostic metrics to improve performance, scalability, and resource utilization in large‑scale data environments.

DataFunTalk
DataFunTalk
DataFunTalk
Optimizing Complex Queries in ClickHouse: Multi‑Stage Execution, Exchange Management, and Runtime Filters

ClickHouse has become a mainstream open‑source OLAP engine, but its two‑stage execution model can cause bottlenecks for complex queries as data volumes grow.

The presentation outlines ByteHouse’s approach to solving these issues, covering project background, technical design, and future outlook.

Project Background

ClickHouse’s execution consists of a coordinator and multiple workers. In the first stage the coordinator distributes the query to workers; in the second stage it aggregates results. This model struggles with large result sets, memory‑intensive joins, and multi‑table queries.

Technical Solution

Design Philosophy

Replace the two‑stage model with a multi‑stage (Stage) execution similar to Presto or Impala, where each Stage exchanges data via an ExchangeManager and contains no intra‑Stage data exchange.

Key Terminology

ExchangeNode – represents a data‑exchange point in the query plan.

PlanSegment – the executable fragment for a single Stage.

ExchangeManager – handles data transfer between Stages.

SegmentScheduler – dispatches PlanSegments to workers.

InterpreterPlanSegment – executes a PlanSegment on a worker.

Execution Flow

Coordinator inserts Exchange nodes into the query plan.

Plan is split into PlanSegments (Stages).

SegmentScheduler sends each PlanSegment to appropriate workers.

Workers execute their PlanSegments and exchange data via ExchangeManager.

Coordinator collects final results and returns them to the client.

Plan Splitting Example

A two‑table join can be divided into four Stages, allowing parallel processing of data reads, joins, and aggregations.

Segment Scheduler Strategies

Dependency‑driven scheduling – respects Stage dependencies (DAG).

All‑At‑Once – schedules all Stages simultaneously after computing dependencies.

InterpreterPlanSegment

Executes the serialized PlanSegment, reads input (local table or exchange), runs the plan logic, and outputs results either to the client or downstream ExchangeManager.

ExchangeManager

Manages push‑based data transfer, implements back‑pressure, fine‑grained memory control, connection reuse, and optional RDMA for high‑throughput networks.

Optimization & Diagnosis

Multiple join implementations: Shuffle Join, Broadcast Join, Co‑locate Join.

Network connection reuse to limit the number of sockets per cluster.

Runtime Filters (min‑max, Bloom) to prune left‑table data before join.

Extensive metrics and back‑pressure monitoring to identify bottlenecks.

Results & Outlook

Benchmarks on a 1 TB SSB dataset across 8 nodes show significant speed‑ups: complex aggregations improve from 8.5 s to 2.2 s, large‑table joins from 17.2 s to 1.7 s, and five‑table joins from 8.6 s to 4.5 s. Future work includes further performance tuning of execution and exchange, richer metrics, and smarter automated diagnostics.

OptimizationClickHouseDistributed QueryByteHouseExchange ManagerStage ExecutionRuntime Filter
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.