Big Data 33 min read

Accelerating Spark with ClickHouse: Native Optimization Techniques and Performance Evaluation

This article presents a comprehensive technical overview of using ClickHouse as a native backend to accelerate Spark SQL execution, covering Spark performance bottlenecks, ClickHouse's CPU‑level optimizations, the design and implementation of the Spark‑Native integration, and detailed TPC‑DS benchmark results demonstrating up to 3.5× speedup.

DataFunTalk

Jun 28, 2024

Accelerating Spark with ClickHouse: Native Optimization Techniques and Performance Evaluation

1 Spark Performance Optimization Background

Spark originated from UC Berkeley AMPLab and has become a leading open‑source engine for batch, streaming, and machine‑learning workloads. In Spark SQL, the driver generates a logical plan, optimizes it, and produces a physical plan that is split into stages and tasks, where the driver work is lightweight but tasks consume most of the compute resources.

1.1 Spark Overview

Typical SQL execution involves generating a logical plan, applying multiple optimization passes, and finally creating a physical plan that is divided into stages. Each stage consists of many tasks that run on executors, processing data row‑wise using JVM‑based operators.

1.2 Spark Task Characteristics

Tasks run on the JVM, which limits low‑level optimizations such as SIMD usage and efficient memory access. Row‑oriented data structures increase GC pressure and hinder vectorized execution.

1.3 Native Acceleration Idea

The proposed solution rewrites the task‑level execution path in C++ and stores data column‑wise, leveraging ClickHouse's high‑performance OLAP engine. By replacing Scan, Agg, and Shuffle operators with ClickHouse equivalents, the system can exploit columnar storage, SIMD, and loop‑unrolling techniques.

2 ClickHouse Performance Advantages

ClickHouse achieves superior CPU utilization through pipeline‑filled execution, SIMD vectorization, and aggressive loop unrolling. Columnar storage enables efficient prefetching and reduces memory bandwidth pressure. The engine also uses C++ templates and just‑in‑time compilation (JIT) to generate highly optimized machine code.

2.1 CPU Pipeline and SIMD

Modern CPUs execute instructions in multiple pipeline stages; keeping all stages busy maximizes throughput. SIMD allows a single instruction to process multiple data elements simultaneously, providing up to 4× speedup for simple arithmetic.

2.2 ClickHouse Code Example

For a simple aggregation query, ClickHouse generates C++ loops with explicit loop unrolling and SIMD intrinsics, as illustrated by the following SQL snippet:

select f1 + f2, f3, f4, f5, f6 * f7 from tbl

2.3 Template‑Based Optimizations

ClickHouse heavily uses C++ templates to eliminate virtual function overhead and to produce type‑specific code paths, especially in hash‑based aggregation and join operators.

3 Spark Native Acceleration Design and Implementation

The integration follows two design principles: full compatibility with native Spark semantics and maximal performance gain. Execution plans are generated in the driver, serialized via Protobuf, and dispatched to a ClickHouse dynamic library through JNI. Custom operators such as SparkShuffleSink and SparkShuffleSource bridge Spark's shuffle framework with ClickHouse.

3.1 Execution Plan Serialization

Physical operators (e.g., FileScanExec) are mapped to ClickHouse equivalents (e.g., DFSSource) and their parameters are encoded in Protobuf messages, preserving stage boundaries and shuffle identifiers.

3.2 Decimal Type Compatibility

To ensure identical results, the Spark‑Native layer adjusts ClickHouse's decimal precision/scale inference to match Spark's Hive‑compatible rules.

3.3 Conditional Join Optimization

For semi‑joins with inequality predicates, the solution implements a custom code‑generated expression evaluator that avoids Cartesian product expansion, extending ClickHouse's JIT framework to support column‑index based inputs.

3.4 Fallback Mechanism

When an operator cannot be executed in ClickHouse, the system falls back to native Spark at the operator, stage, or driver level, guaranteeing that performance never degrades below the vanilla Spark baseline.

4 Acceleration Effect Analysis

Benchmarking on a 10 TB TPC‑DS workload shows an average 2.3× speedup, with some queries (e.g., Q23) achieving up to 3.5× improvement. Queries dominated by simple scans, hash joins, and columnar shuffle writes benefit most, while queries with many consecutive joins (e.g., Q72) fall back to Spark and see no gain.

4.1 TPC‑DS Performance Summary

The overall reduction in execution time translates to roughly halving cluster resource requirements for the same workload.

4.2 Practical Usage

Users can enable the feature in Baidu Cloud BMR by setting spark.sql.execution.clickhouse=true. Currently supported file formats are Parquet and ORC.

For more details, refer to the original presentation and the linked source code repositories.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Optimization Big Data ClickHouse Spark Native Execution

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.